The Push Job takes data from a Hadoop grid and writes it to Venice.
There are two modes the Push Job can run in:
- Full Push (default)
- Incremental Push
When performing a Full Push, the user takes advantage of the fact that Venice’s datasets are versioned. A Full Push triggers the dynamic creation of a new dataset version, and then loads data into it in the background. The new dataset version is called a “future” version as long as data is still loading, and during that time, no online read traffic will be served from it. When the loading is determined to have successfully completed in a given datacenter, the new dataset version transitions from “future” to “current”, whereas the old dataset version transitions from “current” to “backup”. When a dataset version becomes current, online read traffic gets routed to it.
When performing an Incremental Push, no new dataset versions are created, and data gets added into all existing versions of the dataset. This leverages the same mechanism as Streaming Writes, and requires that the store be configured as Hybrid.
The Push Job is designed to require as few configs as possible. The following mandatory configs should be unique to each use case, and set by the user:
venice.store.name: The name of the Venice store to push into.
input.path: The HDFS path containing the data to be pushed, populated with one or many Avro files, where each file contains a sequence of records, and where each record has a
keyfield and a
In addition to use case-specific configs, there are also some necessary configs that would typically be the same for all use cases in a given environment (e.g., one value for production, another value for staging, etc.). The following can therefore be configured globally by the operator, in order to make the Push Job even easier to leverage by users:
venice.discover.urls: The URL of the Venice controller.
The user may choose to specify the following configs:
incremental.push: Whether to run the job in incremental mode. Default:
key.field: The name of the key field within the input records. Default:
value.field: The name of the value field within the input records. Default:
allow.duplicate.key: Whether to let the Push Job proceed even if it detects that the input contains multiple records having the same key but distinct values. If set to
true, then the Push Job picks one of the values to be written in a non-deterministic fashion. Default:
extended.schema.validity.check.enabled: Whether to perform extended schema validation on the input (equivalent to the
STRICTmode in avro-util’s SchemaParseConfiguration). If set to
false, it becomes equivalent to avro-util’s
The push job also supports using D2 URLs for automated controller service discovery. To use this, the user or operator must specify the following configs:
trueif the Venice cluster is deployed in a multi-region mode;
venice.discover.urls: The D2 URL of the Venice controller. It must be of the form
d2://<D2 service name of the controller>
d2.zk.hosts.<region name>: The Zookeeper addresses where the components in the specified region are announcing themselves to D2
venice.discover.urlsmust use the D2 service name of the parent controller
parent.controller.region.namemust denote the name of the datacenter where the parent controller is deployed
d2.zk.hosts.<parent controller region>is a mandatory config
venice.discover.urlsmust use the D2 service name of the child controller
source.grid.fabricmust denote the name of the datacenter where Venice is deployed
d2.zk.hosts.<source grid fabric>is a mandatory config
The user or operator may want to specify the following security-related configs: