Derived Data Platform for Planet-Scale Workloads
Venice is a derived data storage platform, providing the following characteristics:
- High throughput asynchronous ingestion from batch and streaming sources (e.g. Hadoop and Samza).
- Low latency online reads via remote queries or in-process caching.
- Active-active replication between regions with CRDT-based conflict resolution.
- Multi-cluster support within each region with operator-driven cluster assignment.
- Multi-tenancy, horizontal scalability and elasticity within each cluster.
The above makes Venice particularly suitable as the stateful component backing a Feature Store, such as Feathr. AI applications feed the output of their ML training jobs into Venice and then query the data for use during online inference workloads.
The Venice write path can be broken down into three granularities: full dataset swap, insertion of many rows into an existing dataset, and updates of some columns of some rows. All three granularities are supported by Hadoop and Samza, thus leading to the below full matrix of supported operations:
|Full dataset swap||Full Push Job||Reprocessing Job|
|Insertion of some rows into an existing dataset||Incremental Push Job||Real-Time Job|
|Updates to some columns of some rows||Incremental Push Job doing Write Compute||Real-Time Job doing Write Compute|
Moreover, the three granularities of write operations can all be mixed within a single dataset. A dataset which gets full dataset swaps in addition to row insertion or row updates is called hybrid.
As part of configuring a store to be hybrid, an important concept is the rewind time, which defines how far back should recent real-time writes be rewound and applied on top of the new generation of the dataset getting swapped in.
Leveraging this mechanism, it is possible to overlay the output of a stream processing job on top of that of a batch job. If using partial updates, then it is possible to have some of the columns be updated in real-time and some in batch, and these two sets of columns can either overlap or be disjoint, as desired.
Write Compute includes two kinds of operations, which can be performed on the value associated with a given key:
- Partial update: set the content of a field within the value.
- Collection merging: add or remove entries in a set or map.
N.B.: Currently, write compute is only supported in conjunction with active-passive replication. Support for active-active replication is under development.
Venice supports the following read APIs:
- Single get: get the value associated with a single key
- Batch get: get the values associated with a set of keys
- Read compute: project some fields and/or compute some function on the fields of values associated with a set of keys.
When using the read compute DSL, the following functions are currently supported:
- Dot product: perform a dot product on the float vector stored in a given field, against another float vector provided as query param, and return the resulting scalar.
- Cosine similarity: perform a cosine similarity on the float vector stored in a given field, against another float vector provided as query param, and return the resulting scalar.
- Hadamard product: perform a Hadamard product on the float vector stored in a given field, against another float vector provided as query param, and return the resulting vector.
- Collection count: return the number of items in the collection stored in a given field.
There are two main client modes for accessing Venice data:
- Classical Venice: perform remote queries against Venice’s distributed backend service. In this mode, read compute queries are pushed down to the backend and only the computation results are returned to the client.
- Da Vinci: eagerly load some or all partitions of the dataset and perform queries against the resulting local cache. Future updates to the data continue to be streamed in and applied to the local cache.
Refer to the Venice quickstart to create your own Venice cluster and play around with some features like creating a data store, batch push, incremental push, and single get. We recommend sticking to our latest stable release.
The following blog posts have previously been published about Venice:
- 2015: Prototyping Venice: Derived Data Platform
- 2017: Building Venice with Apache Helix
- 2017: Building Venice: A Production Software Case Study
- 2017: Venice Hybrid: Doing Lambda Better
- 2018: Venice Performance Optimization
- 2021: Taming memory fragmentation in Venice with Jemalloc
- 2022: Supporting large fanout use cases at scale in Venice
- 2022: Open Sourcing Venice – LinkedIn’s Derived Data Platform
The following talks have been given about Venice:
- 2018: Venice with Apache Kafka & Samza
- 2019: People You May Know: Fast Recommendations over Massive Data
- 2019: Enabling next generation models for PYMK Scale
- 2022: Open Sourcing Venice
- 2023: What is Derived Data? (and Do You Already Have Any?)
- 2023: Partial Updates in Venice
- 2023: When Only the Last Writer Wins We All Lose: Active-Active Geo-Replication in Venice
Keep in mind that older content reflects an earlier phase of the project and may not be entirely correct anymore.
Feel free to engage with the community using our:
- Slack workspace
- Archived and publicly searchable on Linen
- LinkedIn group
- GitHub issues
- Contributor’s guide
Follow us to hear more about the progress of the Venice project and community: