Class VenicePushJobConstants

java.lang.Object
com.linkedin.venice.vpj.VenicePushJobConstants

public final class VenicePushJobConstants extends Object
  • Field Details

    • LEGACY_AVRO_KEY_FIELD_PROP

      public static final String LEGACY_AVRO_KEY_FIELD_PROP
      See Also:
    • LEGACY_AVRO_VALUE_FIELD_PROP

      public static final String LEGACY_AVRO_VALUE_FIELD_PROP
      See Also:
    • KEY_FIELD_PROP

      public static final String KEY_FIELD_PROP
      See Also:
    • VALUE_FIELD_PROP

      public static final String VALUE_FIELD_PROP
      See Also:
    • DEFAULT_KEY_FIELD_PROP

      public static final String DEFAULT_KEY_FIELD_PROP
      See Also:
    • DEFAULT_VALUE_FIELD_PROP

      public static final String DEFAULT_VALUE_FIELD_PROP
      See Also:
    • DEFAULT_SSL_ENABLED

      public static final boolean DEFAULT_SSL_ENABLED
      See Also:
    • SCHEMA_STRING_PROP

      public static final String SCHEMA_STRING_PROP
      See Also:
    • KAFKA_SOURCE_KEY_SCHEMA_STRING_PROP

      public static final String KAFKA_SOURCE_KEY_SCHEMA_STRING_PROP
      See Also:
    • EXTENDED_SCHEMA_VALIDITY_CHECK_ENABLED

      public static final String EXTENDED_SCHEMA_VALIDITY_CHECK_ENABLED
      See Also:
    • DEFAULT_EXTENDED_SCHEMA_VALIDITY_CHECK_ENABLED

      public static final boolean DEFAULT_EXTENDED_SCHEMA_VALIDITY_CHECK_ENABLED
      See Also:
    • UPDATE_SCHEMA_STRING_PROP

      public static final String UPDATE_SCHEMA_STRING_PROP
      See Also:
    • SPARK_NATIVE_INPUT_FORMAT_ENABLED

      public static final String SPARK_NATIVE_INPUT_FORMAT_ENABLED
      See Also:
    • FILE_KEY_SCHEMA

      public static final String FILE_KEY_SCHEMA
      See Also:
    • FILE_VALUE_SCHEMA

      public static final String FILE_VALUE_SCHEMA
      See Also:
    • INCREMENTAL_PUSH

      public static final String INCREMENTAL_PUSH
      See Also:
    • GENERATE_PARTIAL_UPDATE_RECORD_FROM_INPUT

      public static final String GENERATE_PARTIAL_UPDATE_RECORD_FROM_INPUT
      See Also:
    • PARTITION_COUNT

      public static final String PARTITION_COUNT
      See Also:
    • ALLOW_DUPLICATE_KEY

      public static final String ALLOW_DUPLICATE_KEY
      See Also:
    • POLL_STATUS_RETRY_ATTEMPTS

      public static final String POLL_STATUS_RETRY_ATTEMPTS
      See Also:
    • CONTROLLER_REQUEST_RETRY_ATTEMPTS

      public static final String CONTROLLER_REQUEST_RETRY_ATTEMPTS
      See Also:
    • POLL_JOB_STATUS_INTERVAL_MS

      public static final String POLL_JOB_STATUS_INTERVAL_MS
      See Also:
    • JOB_STATUS_IN_UNKNOWN_STATE_TIMEOUT_MS

      public static final String JOB_STATUS_IN_UNKNOWN_STATE_TIMEOUT_MS
      See Also:
    • SEND_CONTROL_MESSAGES_DIRECTLY

      public static final String SEND_CONTROL_MESSAGES_DIRECTLY
      See Also:
    • SOURCE_ETL

      public static final String SOURCE_ETL
      See Also:
    • ETL_VALUE_SCHEMA_TRANSFORMATION

      public static final String ETL_VALUE_SCHEMA_TRANSFORMATION
      See Also:
    • SYSTEM_SCHEMA_READER_ENABLED

      public static final String SYSTEM_SCHEMA_READER_ENABLED
      See Also:
    • SYSTEM_SCHEMA_CLUSTER_D2_SERVICE_NAME

      public static final String SYSTEM_SCHEMA_CLUSTER_D2_SERVICE_NAME
      See Also:
    • SYSTEM_SCHEMA_CLUSTER_D2_ZK_HOST

      public static final String SYSTEM_SCHEMA_CLUSTER_D2_ZK_HOST
      See Also:
    • COMPRESSION_METRIC_COLLECTION_ENABLED

      public static final String COMPRESSION_METRIC_COLLECTION_ENABLED
      Config to enable/disable the feature to collect extra metrics wrt compression. Enabling this collects metrics for all compression strategies regardless of the configured compression strategy. This means: zstd dictionary will be created even if CompressionStrategy.ZSTD_WITH_DICT is not the configured store compression strategy (refer VenicePushJob.shouldBuildZstdCompressionDictionary(com.linkedin.venice.hadoop.PushJobSetting, boolean))

      This config also gets evaluated in VenicePushJob.evaluateCompressionMetricCollectionEnabled(com.linkedin.venice.hadoop.PushJobSetting, boolean)

      See Also:
    • DEFAULT_COMPRESSION_METRIC_COLLECTION_ENABLED

      public static final boolean DEFAULT_COMPRESSION_METRIC_COLLECTION_ENABLED
      See Also:
    • USE_MAPPER_TO_BUILD_DICTIONARY

      public static final String USE_MAPPER_TO_BUILD_DICTIONARY
      Config to enable/disable using mapper to do the below which are currently done in VPJ driver
      1. validate schema,
      2. collect the input data size
      3. build dictionary (if needed: refer VenicePushJob.shouldBuildZstdCompressionDictionary(com.linkedin.venice.hadoop.PushJobSetting, boolean))

      This new mapper was added because the sample collection for Zstd dictionary is currently in-memory and to help play around with the sample size and also to support future enhancements if needed.

      Currently, the plan is to only have it available for MapReduce compute engine and remove it eventually as we remove the MapReduce-related codes.
      See Also:
    • DEFAULT_USE_MAPPER_TO_BUILD_DICTIONARY

      public static final boolean DEFAULT_USE_MAPPER_TO_BUILD_DICTIONARY
      See Also:
    • MINIMUM_NUMBER_OF_SAMPLES_REQUIRED_TO_BUILD_ZSTD_DICTIONARY

      public static final int MINIMUM_NUMBER_OF_SAMPLES_REQUIRED_TO_BUILD_ZSTD_DICTIONARY
      Known zstd lib issue which crashes if the input sample is too small. So adding a preventive check to skip training the dictionary in such cases using a minimum limit of 20. Keeping it simple and hard coding it as if this check doesn't prevent some edge cases then we can disable the feature itself
      See Also:
    • ZSTD_DICTIONARY_CREATION_REQUIRED

      public static final String ZSTD_DICTIONARY_CREATION_REQUIRED
      Configs to pass to AbstractVeniceMapper based on the input configs and Dictionary training status
      See Also:
    • ZSTD_DICTIONARY_CREATION_SUCCESS

      public static final String ZSTD_DICTIONARY_CREATION_SUCCESS
      See Also:
    • VALIDATE_SCHEMA_AND_BUILD_DICT_MAPPER_OUTPUT_DIRECTORY

      public static final String VALIDATE_SCHEMA_AND_BUILD_DICT_MAPPER_OUTPUT_DIRECTORY
      Location and key to store the output of ValidateSchemaAndBuildDictMapper and retrieve it back when USE_MAPPER_TO_BUILD_DICTIONARY is enabled
      See Also:
    • VALIDATE_SCHEMA_AND_BUILD_DICTIONARY_MAPPER_OUTPUT_FILE_PREFIX

      public static final String VALIDATE_SCHEMA_AND_BUILD_DICTIONARY_MAPPER_OUTPUT_FILE_PREFIX
      See Also:
    • VALIDATE_SCHEMA_AND_BUILD_DICTIONARY_MAPPER_OUTPUT_FILE_EXTENSION

      public static final String VALIDATE_SCHEMA_AND_BUILD_DICTIONARY_MAPPER_OUTPUT_FILE_EXTENSION
      See Also:
    • KEY_ZSTD_COMPRESSION_DICTIONARY

      public static final String KEY_ZSTD_COMPRESSION_DICTIONARY
      See Also:
    • KEY_INPUT_FILE_DATA_SIZE

      public static final String KEY_INPUT_FILE_DATA_SIZE
      See Also:
    • SOURCE_KAFKA

      public static final String SOURCE_KAFKA
      Configs used to enable Kafka Input.
      See Also:
    • KAFKA_INPUT_TOPIC

      public static final String KAFKA_INPUT_TOPIC
      TODO: consider to automatically discover the source topic for the specified store. We need to be careful in the following scenarios: 1. Not all the prod colos are using the same current version if the previous push experiences a partial failure. 2. We might want to re-push from a backup version, which should be unlikely.
      See Also:
    • KAFKA_INPUT_FABRIC

      public static final String KAFKA_INPUT_FABRIC
      See Also:
    • KAFKA_INPUT_BROKER_URL

      public static final String KAFKA_INPUT_BROKER_URL
      See Also:
    • KAFKA_INPUT_MAX_RECORDS_PER_MAPPER

      public static final String KAFKA_INPUT_MAX_RECORDS_PER_MAPPER
      See Also:
    • KAFKA_INPUT_COMBINER_ENABLED

      public static final String KAFKA_INPUT_COMBINER_ENABLED
      See Also:
    • KAFKA_INPUT_COMPRESSION_BUILD_NEW_DICT_ENABLED

      public static final String KAFKA_INPUT_COMPRESSION_BUILD_NEW_DICT_ENABLED
      See Also:
    • KAFKA_INPUT_SOURCE_TOPIC_CHUNKING_ENABLED

      public static final String KAFKA_INPUT_SOURCE_TOPIC_CHUNKING_ENABLED
      See Also:
    • REWIND_TIME_IN_SECONDS_OVERRIDE

      public static final String REWIND_TIME_IN_SECONDS_OVERRIDE
      Optional. If we want to use a different rewind time from the default store-level rewind time config for Kafka Input re-push, the following property needs to specified explicitly. This property comes to play when the default rewind time configured in store-level is too short or too long. 1. If the default rewind time config is too short (for example 0 or several mins), it could cause data gap with re-push since the push job itself could take several hours, and we would like to make sure the re-pushed version will contain the same dataset as the source version. 2. If the default rewind time config is too long (such as 28 days), it will be a big waste to rewind so much time since the time gap between the source version and the re-push version should be comparable to the re-push time if the whole ingestion pipeline is not lagging. There are some challenges to automatically detect the right rewind time for re-push because of the following reasons: 1. For Venice Aggregate use case, some colo could be lagging behind other prod colos, so if the re-push source is from a fast colo, too short rewind time could cause a data gap in the slower colos. Ideally, it is good to use the slowest colo as the re-push source. 2. For Venice non-Aggregate use case, the ingestion pipeline will include the following several phases: 2.1 Customer's Kafka aggregation and mirroring pipeline to replicate the same data to all prod colos. 2.2 Venice Ingestion pipeline to consume the local real-time topic. We have visibility to 2.2, but not 2.1, so we may need to work with customer to understand how 2.1 can be measured or use a long enough rewind time to mitigate all the potential issues. Make this property available in generic since it should be useful for ETL+VPJ use case as well.
      See Also:
    • REWIND_EPOCH_TIME_IN_SECONDS_OVERRIDE

      public static final String REWIND_EPOCH_TIME_IN_SECONDS_OVERRIDE
      A time stamp specified to rewind to before replaying data. This config is ignored if rewind.time.in.seconds.override is provided. This config at time of push will be leveraged to fill in the rewind.time.in.seconds.override by taking System.currentTime - rewind.epoch.time.in.seconds.override and storing the result in rewind.time.in.seconds.override. With this in mind, a push policy of REWIND_FROM_SOP should be used in order to get a behavior that makes sense to a user. A timestamp that is in the future is not valid and will result in an exception.
      See Also:
    • SUPPRESS_END_OF_PUSH_MESSAGE

      public static final String SUPPRESS_END_OF_PUSH_MESSAGE
      This config is a boolean which suppresses submitting the end of push message after data has been sent and does not poll for the status of the job to complete. Using this flag means that a user must manually mark the job success or failed.
      See Also:
    • DEFER_VERSION_SWAP

      public static final String DEFER_VERSION_SWAP
      This config is a boolean which waits for an external signal to trigger version swap after buffer replay is complete.
      See Also:
    • D2_ZK_HOSTS_PREFIX

      public static final String D2_ZK_HOSTS_PREFIX
      This config specifies the prefix for d2 zk hosts config. Configs of type <prefix>.<regionName> are expected to be defined.
      See Also:
    • PARENT_CONTROLLER_REGION_NAME

      public static final String PARENT_CONTROLLER_REGION_NAME
      This config specifies the region identifier where parent controller is running
      See Also:
    • REWIND_EPOCH_TIME_BUFFER_IN_SECONDS_OVERRIDE

      public static final String REWIND_EPOCH_TIME_BUFFER_IN_SECONDS_OVERRIDE
      Relates to the above argument. An overridable amount of buffer to be applied to the epoch (as the rewind isn't perfectly instantaneous). Defaults to 1 minute.
      See Also:
    • VENICE_DISCOVER_URL_PROP

      public static final String VENICE_DISCOVER_URL_PROP
      In single-region mode, this must be a comma-separated list of child controller URLs or d2://<d2ServiceNameForChildController> In multi-region mode, it must be a comma-separated list of parent controller URLs or d2://<d2ServiceNameForParentController>
      See Also:
    • SOURCE_GRID_FABRIC

      public static final String SOURCE_GRID_FABRIC
      An identifier of the data center which is used to determine the Kafka URL and child controllers that push jobs communicate with
      See Also:
    • ENABLE_WRITE_COMPUTE

      public static final String ENABLE_WRITE_COMPUTE
      See Also:
    • ENABLE_SSL

      public static final String ENABLE_SSL
      See Also:
    • VENICE_STORE_NAME_PROP

      public static final String VENICE_STORE_NAME_PROP
      See Also:
    • INPUT_PATH_PROP

      public static final String INPUT_PATH_PROP
      See Also:
    • INPUT_PATH_LAST_MODIFIED_TIME

      public static final String INPUT_PATH_LAST_MODIFIED_TIME
      See Also:
    • BATCH_NUM_BYTES_PROP

      public static final String BATCH_NUM_BYTES_PROP
      See Also:
    • PATH_FILTER

      public static final org.apache.hadoop.fs.PathFilter PATH_FILTER
      ignore hdfs files with prefix "_" and "."
    • GLOB_FILTER_PATTERN

      public static final String GLOB_FILTER_PATTERN
      See Also:
    • HADOOP_TMP_DIR

      public static final String HADOOP_TMP_DIR
      See Also:
    • TEMP_DIR_PREFIX

      public static final String TEMP_DIR_PREFIX
      See Also:
    • PERMISSION_777

      public static final org.apache.hadoop.fs.permission.FsPermission PERMISSION_777
    • PERMISSION_700

      public static final org.apache.hadoop.fs.permission.FsPermission PERMISSION_700
    • VALUE_SCHEMA_ID_PROP

      public static final String VALUE_SCHEMA_ID_PROP
      See Also:
    • DERIVED_SCHEMA_ID_PROP

      public static final String DERIVED_SCHEMA_ID_PROP
      See Also:
    • TOPIC_PROP

      public static final String TOPIC_PROP
      See Also:
    • HADOOP_VALIDATE_SCHEMA_AND_BUILD_DICT_PREFIX

      public static final String HADOOP_VALIDATE_SCHEMA_AND_BUILD_DICT_PREFIX
      See Also:
    • SSL_PREFIX

      public static final String SSL_PREFIX
      See Also:
    • STORAGE_QUOTA_PROP

      public static final String STORAGE_QUOTA_PROP
      See Also:
    • STORAGE_ENGINE_OVERHEAD_RATIO

      public static final String STORAGE_ENGINE_OVERHEAD_RATIO
      See Also:
    • VSON_PUSH

      @Deprecated public static final String VSON_PUSH
      Deprecated.
      See Also:
    • KAFKA_SECURITY_PROTOCOL

      public static final String KAFKA_SECURITY_PROTOCOL
      See Also:
    • COMPRESSION_STRATEGY

      public static final String COMPRESSION_STRATEGY
      See Also:
    • KAFKA_INPUT_SOURCE_COMPRESSION_STRATEGY

      public static final String KAFKA_INPUT_SOURCE_COMPRESSION_STRATEGY
      See Also:
    • SSL_CONFIGURATOR_CLASS_CONFIG

      public static final String SSL_CONFIGURATOR_CLASS_CONFIG
      See Also:
    • SSL_KEY_STORE_PROPERTY_NAME

      public static final String SSL_KEY_STORE_PROPERTY_NAME
      See Also:
    • SSL_TRUST_STORE_PROPERTY_NAME

      public static final String SSL_TRUST_STORE_PROPERTY_NAME
      See Also:
    • SSL_KEY_STORE_PASSWORD_PROPERTY_NAME

      public static final String SSL_KEY_STORE_PASSWORD_PROPERTY_NAME
      See Also:
    • SSL_KEY_PASSWORD_PROPERTY_NAME

      public static final String SSL_KEY_PASSWORD_PROPERTY_NAME
      See Also:
    • JOB_EXEC_URL

      public static final String JOB_EXEC_URL
      This will define the execution servers url for easy access to the job execution during debugging. This can be any string that is meaningful to the execution environment.
      See Also:
    • JOB_SERVER_NAME

      public static final String JOB_SERVER_NAME
      The short name of the server where the job runs
      See Also:
    • JOB_EXEC_ID

      public static final String JOB_EXEC_ID
      The execution ID of the execution if this job is a part of a multi-step flow. Each step in the flow can have the same execution id
      See Also:
    • PUSH_JOB_STATUS_UPLOAD_ENABLE

      public static final String PUSH_JOB_STATUS_UPLOAD_ENABLE
      Config to enable the service that uploads push job statuses to the controller using ControllerClient.uploadPushJobStatus(), the job status is then packaged and sent to dedicated Kafka channel.
      See Also:
    • REDUCER_SPECULATIVE_EXECUTION_ENABLE

      public static final String REDUCER_SPECULATIVE_EXECUTION_ENABLE
      See Also:
    • TELEMETRY_MESSAGE_INTERVAL

      public static final String TELEMETRY_MESSAGE_INTERVAL
      The interval of number of messages upon which certain info is printed in the reducer logs.
      See Also:
    • ZSTD_COMPRESSION_LEVEL

      public static final String ZSTD_COMPRESSION_LEVEL
      Config to control the Compression Level for ZSTD Dictionary Compression.
      See Also:
    • DEFAULT_BATCH_BYTES_SIZE

      public static final int DEFAULT_BATCH_BYTES_SIZE
      See Also:
    • SORTED

      public static final boolean SORTED
      See Also:
    • DEFAULT_RE_PUSH_REWIND_IN_SECONDS_OVERRIDE

      public static final long DEFAULT_RE_PUSH_REWIND_IN_SECONDS_OVERRIDE
      The rewind override when performing re-push to prevent data loss; if the store has higher rewind config setting than 1 days, adopt the store config instead; otherwise, override the rewind config to 1 day if push job config doesn't try to override it.
      See Also:
    • REPUSH_TTL_ENABLE

      public static final String REPUSH_TTL_ENABLE
      Config to control the TTL behaviors in repush.
      See Also:
    • REPUSH_TTL_POLICY

      public static final String REPUSH_TTL_POLICY
      See Also:
    • REPUSH_TTL_SECONDS

      public static final String REPUSH_TTL_SECONDS
      See Also:
    • REPUSH_TTL_START_TIMESTAMP

      public static final String REPUSH_TTL_START_TIMESTAMP
      See Also:
    • RMD_SCHEMA_DIR

      public static final String RMD_SCHEMA_DIR
      See Also:
    • VALUE_SCHEMA_DIR

      public static final String VALUE_SCHEMA_DIR
      See Also:
    • NOT_SET

      public static final int NOT_SET
      See Also:
    • TARGETED_REGION_PUSH_ENABLED

      public static final String TARGETED_REGION_PUSH_ENABLED
      Config to enable single targeted region push mode in VPJ. In this mode, the VPJ will only push data to a single region. The single region is decided by the store config in StoreInfo.getNativeReplicationSourceFabric()}. For multiple targeted regions push, may use the advanced mode. See TARGETED_REGION_PUSH_LIST.
      See Also:
    • TARGETED_REGION_PUSH_LIST

      public static final String TARGETED_REGION_PUSH_LIST
      This is experimental config to specify a list of regions used for targeted region push in VPJ. TARGETED_REGION_PUSH_ENABLED has to be enabled to use this config. In this mode, the VPJ will only push data to the provided regions. The input is comma separated list of region names, e.g. "dc-0,dc-1,dc-2". For single targeted region push, see TARGETED_REGION_PUSH_ENABLED.
      See Also:
    • TARGETED_REGION_PUSH_WITH_DEFERRED_SWAP

      public static final String TARGETED_REGION_PUSH_WITH_DEFERRED_SWAP
      Config to enable target region push with deferred version swap In this mode, the VPJ will push data to all regions and only switch to the new version in a single region. The single region is decided by the store config in StoreInfo.getNativeReplicationSourceFabric()} unless a list of regions is passed in from targeted.region.push.list After a specified wait time (default 1h), the remaining regions will switch to the new version.
      See Also:
    • DEFAULT_IS_DUPLICATED_KEY_ALLOWED

      public static final boolean DEFAULT_IS_DUPLICATED_KEY_ALLOWED
      See Also:
    • MAP_REDUCE_PARTITIONER_CLASS_CONFIG

      public static final String MAP_REDUCE_PARTITIONER_CLASS_CONFIG
      Config used only for tests. Should not be used at regular runtime.
      See Also:
    • UNCREATED_VERSION_NUMBER

      public static final int UNCREATED_VERSION_NUMBER
      Placeholder for version number that is yet to be created.
      See Also:
    • DEFAULT_POLL_STATUS_INTERVAL_MS

      public static final long DEFAULT_POLL_STATUS_INTERVAL_MS
      See Also:
    • DEFAULT_JOB_STATUS_IN_UNKNOWN_STATE_TIMEOUT_MS

      public static final long DEFAULT_JOB_STATUS_IN_UNKNOWN_STATE_TIMEOUT_MS
      The default total time we wait before failing a job if the job status stays in UNKNOWN state.
      See Also:
    • NON_CRITICAL_EXCEPTION

      public static final String NON_CRITICAL_EXCEPTION
      See Also:
    • COMPRESSION_DICTIONARY_SAMPLE_SIZE

      public static final String COMPRESSION_DICTIONARY_SAMPLE_SIZE
      Sample size to collect for building dictionary: Can be assigned a max of 2GB as ZstdDictTrainer in ZSTD library takes in sample size as int
      See Also:
    • DEFAULT_COMPRESSION_DICTIONARY_SAMPLE_SIZE

      public static final int DEFAULT_COMPRESSION_DICTIONARY_SAMPLE_SIZE
      See Also:
    • COMPRESSION_DICTIONARY_SIZE_LIMIT

      public static final String COMPRESSION_DICTIONARY_SIZE_LIMIT
      Maximum final dictionary size TODO add more details about the current limits
      See Also:
    • DATA_WRITER_COMPUTE_JOB_CLASS

      public static final String DATA_WRITER_COMPUTE_JOB_CLASS
      Config to set the class for the DataWriter job. When using KIF, we currently will continue to fall back to MR mode. The class must extend DataWriterComputeJob and have a zero-arg constructor.
      See Also:
    • PUSH_TO_SEPARATE_REALTIME_TOPIC

      public static final String PUSH_TO_SEPARATE_REALTIME_TOPIC
      See Also: