Class VTConsistencyCheckerJob

java.lang.Object
com.linkedin.venice.spark.consistency.VTConsistencyCheckerJob

public class VTConsistencyCheckerJob extends Object
Spark job that checks VT consistency between two DCs using the Lily Pad algorithm.

Parallelizes across VT partitions: each Spark task handles one partition, consuming directly from both DCs' PubSub brokers and running the Lily Pad algorithm to find keys where the two leaders disagreed despite having full information.

Output is written as Parquet. Each row is one detected inconsistency with full forensic context (key hash, value hashes from both DCs, position vectors, high watermarks).

Example invocation:

   java -cp venice-push-job-all.jar \
     com.linkedin.venice.spark.consistency.VTConsistencyCheckerJob \
     /path/to/vt-consistency.properties
 

Required config keys:

  • "dc0.broker.url" — PubSub broker address for DC-0
  • "dc1.broker.url" — PubSub broker address for DC-1
  • "store.name" — store name; the version topic is resolved to the store's current version via the controller
  • venice.discover.urls — comma-separated HTTP controller URLs used to look up the current version. D2-based discovery is not supported here; use HTTP URLs.
  • "output.path" — output path to write Parquet results
  • "number.of.regions" — total number of regions in the AA topology

Spark configs can be passed via the venice.spark.session.conf. prefix:

   venice.spark.cluster=yarn
   venice.spark.session.conf.spark.executor.memory=20g
   venice.spark.session.conf.spark.executor.instances=20