Class VenicePushJobConstants
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final String
static final String
Currently regular batch pushes are not compatible with TTL re-push enabled stores.static final String
static final String
Sample size to collect for building dictionary: Can be assigned a max of 2GB asZstdDictTrainer
in ZSTD library takes in sample size as intstatic final String
Maximum final dictionary size TODO add more details about the current limitsstatic final String
Config to enable/disable the feature to collect extra metrics wrt compression.static final String
static final String
static final String
This config specifies the prefix for d2 zk hosts config.static final String
Config to set the class for the DataWriter job.static final int
static final int
static final boolean
static final boolean
static final boolean
static final long
The default total time we wait before failing a job if the job status stays in UNKNOWN state.static final String
static final int
Default maximum number of splits to create per PubSub topic-partition.static final long
static final long
The default max records per mapper, and if there are more records in one topic partition, it will be consumed by multiple mappers in parallel.static final boolean
static final String
Default split type for PubSub input.static final long
Default time window (in minutes) for splitting PubSub input topic-partition.static final long
The rewind override when performing re-push to prevent data loss; if the store has higher rewind config setting than 1 days, adopt the store config instead; otherwise, override the rewind config to 1 day if push job config doesn't try to override it.static final boolean
static final String
static final String
static final String
This config is a boolean which waits for an external signal to trigger version swap after buffer replay is complete.static final String
static final String
static final String
static final String
static final String
static final String
static final String
static final String
static final String
static final String
static final String
static final String
Config to enable memtable based ingestion of hybrid store batch push.static final String
static final String
static final String
static final String
The execution ID of the execution if this job is a part of a multi-step flow.static final String
This will define the execution servers url for easy access to the job execution during debugging.static final String
The short name of the server where the job runsstatic final String
static final String
static final String
static final String
static final String
static final String
static final String
static final String
static final String
TODO: consider to automatically discover the source topic for the specified store.static final String
static final String
static final String
static final String
static final String
static final String
static final String
static final String
Config used only for tests.static final int
Known zstd lib issue which crashes if the input sample is too small.static final String
static final int
static final String
This config specifies the region identifier where parent controller is runningstatic final String
static final org.apache.hadoop.fs.PathFilter
ignore hdfs files with prefix "_" and "."static final org.apache.hadoop.fs.permission.FsPermission
static final org.apache.hadoop.fs.permission.FsPermission
static final String
static final String
static final String
Configuration key for the maximum number of splits to create per PubSub topic-partition.static final String
Use a locally generated logical index as the secondary comparator after comparing keys in repush mappers.static final String
Configuration key for specifying the PubSub input split strategy.static final String
Configuration key for the time window (in minutes) used to split PubSub input topic-partition.static final String
static final String
static final String
static final String
Config to control the TTL behaviors in repush.static final String
static final String
static final String
static final String
Relates to the above argument.static final String
A time stamp specified to rewind to before replaying data.static final String
Optional.static final String
static final String
static final String
static final String
static final String
static final String
An identifier of the data center which is used to determine the Kafka URL and child controllers that push jobs communicate withstatic final String
Configs used to enable Kafka Input.static final String
static final String
static final String
static final String
static final String
static final String
static final String
static final String
static final String
static final String
This config is a boolean which suppresses submitting the end of push message after data has been sent and does not poll for the status of the job to complete.static final String
static final String
static final String
static final String
Config to enable single targeted region push mode in VPJ.static final String
This is experimental config to specify a list of regions used for targeted region push in VPJ.static final String
Config to enable target region push with deferred version swap In this mode, the VPJ will push data to all regions and only switch to the new version in a single region.static final String
The interval of number of messages upon which certain info is printed in the reducer logs.static final String
static final String
static final String
static final int
Placeholder for version number that is yet to be created.static final String
static final String
static final String
static final String
static final String
In single-region mode, this must be a comma-separated list of child controller URLs or d2://<d2ServiceNameForChildController> In multi-region mode, it must be a comma-separated list of parent controller URLs or d2://<d2ServiceNameForParentController>static final String
static final String
Deprecated.static final String
Config to control the Compression Level for ZSTD Dictionary Compression.static final String
Configs to pass toAbstractVeniceMapper
based on the input configs and Dictionary training statusstatic final String
-
Method Summary
-
Field Details
-
LEGACY_AVRO_KEY_FIELD_PROP
- See Also:
-
LEGACY_AVRO_VALUE_FIELD_PROP
- See Also:
-
KEY_FIELD_PROP
- See Also:
-
VALUE_FIELD_PROP
- See Also:
-
TIMESTAMP_FIELD_PROP
- See Also:
-
DEFAULT_KEY_FIELD_PROP
- See Also:
-
DEFAULT_VALUE_FIELD_PROP
- See Also:
-
DEFAULT_TIMESTAMP_FIELD_PROP
- See Also:
-
DEFAULT_SSL_ENABLED
public static final boolean DEFAULT_SSL_ENABLED- See Also:
-
SCHEMA_STRING_PROP
- See Also:
-
KAFKA_SOURCE_KEY_SCHEMA_STRING_PROP
- See Also:
-
EXTENDED_SCHEMA_VALIDITY_CHECK_ENABLED
- See Also:
-
DEFAULT_EXTENDED_SCHEMA_VALIDITY_CHECK_ENABLED
public static final boolean DEFAULT_EXTENDED_SCHEMA_VALIDITY_CHECK_ENABLED- See Also:
-
UPDATE_SCHEMA_STRING_PROP
- See Also:
-
RMD_SCHEMA_PROP
- See Also:
-
SPARK_NATIVE_INPUT_FORMAT_ENABLED
- See Also:
-
FILE_KEY_SCHEMA
- See Also:
-
FILE_VALUE_SCHEMA
- See Also:
-
INCREMENTAL_PUSH
- See Also:
-
GENERATE_PARTIAL_UPDATE_RECORD_FROM_INPUT
- See Also:
-
PARTITION_COUNT
- See Also:
-
ALLOW_DUPLICATE_KEY
- See Also:
-
POLL_STATUS_RETRY_ATTEMPTS
- See Also:
-
CONTROLLER_REQUEST_RETRY_ATTEMPTS
- See Also:
-
POLL_JOB_STATUS_INTERVAL_MS
- See Also:
-
JOB_STATUS_IN_UNKNOWN_STATE_TIMEOUT_MS
- See Also:
-
PUSH_JOB_TIMEOUT_OVERRIDE_MS
- See Also:
-
SEND_CONTROL_MESSAGES_DIRECTLY
- See Also:
-
SOURCE_ETL
- See Also:
-
ETL_VALUE_SCHEMA_TRANSFORMATION
- See Also:
-
SYSTEM_SCHEMA_READER_ENABLED
- See Also:
-
SYSTEM_SCHEMA_CLUSTER_D2_SERVICE_NAME
- See Also:
-
SYSTEM_SCHEMA_CLUSTER_D2_ZK_HOST
- See Also:
-
COMPRESSION_METRIC_COLLECTION_ENABLED
Config to enable/disable the feature to collect extra metrics wrt compression. Enabling this collects metrics for all compression strategies regardless of the configured compression strategy. This means: zstd dictionary will be created even ifCompressionStrategy.ZSTD_WITH_DICT
is not the configured store compression strategy (referVenicePushJob.shouldBuildZstdCompressionDictionary(com.linkedin.venice.hadoop.PushJobSetting, boolean)
)
This config also gets evaluated inVenicePushJob.evaluateCompressionMetricCollectionEnabled(com.linkedin.venice.hadoop.PushJobSetting, boolean)
- See Also:
-
DEFAULT_COMPRESSION_METRIC_COLLECTION_ENABLED
public static final boolean DEFAULT_COMPRESSION_METRIC_COLLECTION_ENABLED- See Also:
-
MINIMUM_NUMBER_OF_SAMPLES_REQUIRED_TO_BUILD_ZSTD_DICTIONARY
public static final int MINIMUM_NUMBER_OF_SAMPLES_REQUIRED_TO_BUILD_ZSTD_DICTIONARYKnown zstd lib issue which crashes if the input sample is too small. So adding a preventive check to skip training the dictionary in such cases using a minimum limit of 20. Keeping it simple and hard coding it as if this check doesn't prevent some edge cases then we can disable the feature itself- See Also:
-
ZSTD_DICTIONARY_CREATION_REQUIRED
Configs to pass toAbstractVeniceMapper
based on the input configs and Dictionary training status- See Also:
-
ZSTD_DICTIONARY_CREATION_SUCCESS
- See Also:
-
KEY_ZSTD_COMPRESSION_DICTIONARY
- See Also:
-
KEY_INPUT_FILE_DATA_SIZE
- See Also:
-
SOURCE_KAFKA
Configs used to enable Kafka Input.- See Also:
-
KAFKA_INPUT_TOPIC
TODO: consider to automatically discover the source topic for the specified store. We need to be careful in the following scenarios: 1. Not all the prod colos are using the same current version if the previous push experiences a partial failure. 2. We might want to re-push from a backup version, which should be unlikely.- See Also:
-
KAFKA_INPUT_FABRIC
- See Also:
-
KAFKA_INPUT_BROKER_URL
- See Also:
-
KAFKA_INPUT_MAX_RECORDS_PER_MAPPER
- See Also:
-
KIF_RECORD_READER_KAFKA_CONFIG_PREFIX
- See Also:
-
DEFAULT_PUBSUB_INPUT_MAX_RECORDS_PER_MAPPER
public static final long DEFAULT_PUBSUB_INPUT_MAX_RECORDS_PER_MAPPERThe default max records per mapper, and if there are more records in one topic partition, it will be consumed by multiple mappers in parallel. BTW, this calculation is not accurate since it is purely based on offset, and the topic being consumed could have log compaction enabled.- See Also:
-
PUBSUB_INPUT_SECONDARY_COMPARATOR_USE_LOCAL_LOGICAL_INDEX
Use a locally generated logical index as the secondary comparator after comparing keys in repush mappers. Both strategies order records latest first: - Disabled: use PubSub position/offset, higher position first. - Enabled: use a local logical index (assigned after poll call), higher indices first. This remains configurable because the local index may misorder records in rare cases: for example, if log compaction occurs during repush consumption and a split fails or runs speculatively, a newer record could get a lower logical index. Offsets avoid that risk. Once we no longer depend on PubSub log compaction, the logical index alone will be sufficient. Default: false- See Also:
-
DEFAULT_PUBSUB_INPUT_SECONDARY_COMPARATOR_USE_LOCAL_LOGICAL_INDEX
public static final boolean DEFAULT_PUBSUB_INPUT_SECONDARY_COMPARATOR_USE_LOCAL_LOGICAL_INDEX- See Also:
-
PUBSUB_INPUT_SPLIT_STRATEGY
Configuration key for specifying the PubSub input split strategy.The split type determines how input splits are generated for processing PubSub records. Supported values include
PartitionSplitStrategy
supported by the system.- See Also:
-
DEFAULT_PUBSUB_INPUT_SPLIT_STRATEGY
Default split type for PubSub input.This value is derived from
PartitionSplitStrategy.FIXED_RECORD_COUNT
, which generates splits containing a fixed number of records. -
PUBSUB_INPUT_MAX_SPLITS_PER_PARTITION
Configuration key for the maximum number of splits to create per PubSub topic-partition.This setting limits the total number of input splits generated for large partitions. The default value is
DEFAULT_MAX_SPLITS_PER_PARTITION
.- See Also:
-
DEFAULT_MAX_SPLITS_PER_PARTITION
public static final int DEFAULT_MAX_SPLITS_PER_PARTITIONDefault maximum number of splits to create per PubSub topic-partition.This value is taken from
SplitRequest.DEFAULT_MAX_SPLITS
.- See Also:
-
PUBSUB_INPUT_SPLIT_TIME_WINDOW_IN_MINUTES
Configuration key for the time window (in minutes) used to split PubSub input topic-partition.Splits are generated so that each covers at most the configured time window. The default value is
DEFAULT_PUBSUB_INPUT_TIME_WINDOW_IN_MINUTES
.- See Also:
-
DEFAULT_PUBSUB_INPUT_TIME_WINDOW_IN_MINUTES
public static final long DEFAULT_PUBSUB_INPUT_TIME_WINDOW_IN_MINUTESDefault time window (in minutes) for splitting PubSub input topic-partition.This value is derived from
SplitRequest.DEFAULT_TIME_WINDOW_MS
and converted from milliseconds to minutes. -
KAFKA_INPUT_COMBINER_ENABLED
- See Also:
-
KAFKA_INPUT_COMPRESSION_BUILD_NEW_DICT_ENABLED
- See Also:
-
KAFKA_INPUT_SOURCE_TOPIC_CHUNKING_ENABLED
- See Also:
-
REWIND_TIME_IN_SECONDS_OVERRIDE
Optional. If we want to use a different rewind time from the default store-level rewind time config for Kafka Input re-push, the following property needs to specified explicitly. This property comes to play when the default rewind time configured in store-level is too short or too long. 1. If the default rewind time config is too short (for example 0 or several mins), it could cause data gap with re-push since the push job itself could take several hours, and we would like to make sure the re-pushed version will contain the same dataset as the source version. 2. If the default rewind time config is too long (such as 28 days), it will be a big waste to rewind so much time since the time gap between the source version and the re-push version should be comparable to the re-push time if the whole ingestion pipeline is not lagging. There are some challenges to automatically detect the right rewind time for re-push because of the following reasons: 1. For Venice Aggregate use case, some colo could be lagging behind other prod colos, so if the re-push source is from a fast colo, too short rewind time could cause a data gap in the slower colos. Ideally, it is good to use the slowest colo as the re-push source. 2. For Venice non-Aggregate use case, the ingestion pipeline will include the following several phases: 2.1 Customer's Kafka aggregation and mirroring pipeline to replicate the same data to all prod colos. 2.2 Venice Ingestion pipeline to consume the local real-time topic. We have visibility to 2.2, but not 2.1, so we may need to work with customer to understand how 2.1 can be measured or use a long enough rewind time to mitigate all the potential issues. Make this property available in generic since it should be useful for ETL+VPJ use case as well.- See Also:
-
REWIND_EPOCH_TIME_IN_SECONDS_OVERRIDE
A time stamp specified to rewind to before replaying data. This config is ignored if rewind.time.in.seconds.override is provided. This config at time of push will be leveraged to fill in the rewind.time.in.seconds.override by taking System.currentTime - rewind.epoch.time.in.seconds.override and storing the result in rewind.time.in.seconds.override. With this in mind, a push policy of REWIND_FROM_SOP should be used in order to get a behavior that makes sense to a user. A timestamp that is in the future is not valid and will result in an exception.- See Also:
-
SUPPRESS_END_OF_PUSH_MESSAGE
This config is a boolean which suppresses submitting the end of push message after data has been sent and does not poll for the status of the job to complete. Using this flag means that a user must manually mark the job success or failed.- See Also:
-
DEFER_VERSION_SWAP
This config is a boolean which waits for an external signal to trigger version swap after buffer replay is complete.- See Also:
-
D2_ZK_HOSTS_PREFIX
This config specifies the prefix for d2 zk hosts config. Configs of type <prefix>.<regionName> are expected to be defined.- See Also:
-
PARENT_CONTROLLER_REGION_NAME
This config specifies the region identifier where parent controller is running- See Also:
-
REWIND_EPOCH_TIME_BUFFER_IN_SECONDS_OVERRIDE
Relates to the above argument. An overridable amount of buffer to be applied to the epoch (as the rewind isn't perfectly instantaneous). Defaults to 1 minute.- See Also:
-
VENICE_DISCOVER_URL_PROP
In single-region mode, this must be a comma-separated list of child controller URLs or d2://<d2ServiceNameForChildController> In multi-region mode, it must be a comma-separated list of parent controller URLs or d2://<d2ServiceNameForParentController>- See Also:
-
SOURCE_GRID_FABRIC
An identifier of the data center which is used to determine the Kafka URL and child controllers that push jobs communicate with- See Also:
-
ENABLE_WRITE_COMPUTE
- See Also:
-
ENABLE_SSL
- See Also:
-
VENICE_STORE_NAME_PROP
- See Also:
-
INPUT_PATH_PROP
- See Also:
-
INPUT_PATH_LAST_MODIFIED_TIME
- See Also:
-
BATCH_NUM_BYTES_PROP
- See Also:
-
PATH_FILTER
public static final org.apache.hadoop.fs.PathFilter PATH_FILTERignore hdfs files with prefix "_" and "." -
GLOB_FILTER_PATTERN
- See Also:
-
HADOOP_TMP_DIR
- See Also:
-
TEMP_DIR_PREFIX
- See Also:
-
PERMISSION_777
public static final org.apache.hadoop.fs.permission.FsPermission PERMISSION_777 -
PERMISSION_700
public static final org.apache.hadoop.fs.permission.FsPermission PERMISSION_700 -
VALUE_SCHEMA_ID_PROP
- See Also:
-
DERIVED_SCHEMA_ID_PROP
- See Also:
-
TOPIC_PROP
- See Also:
-
HADOOP_VALIDATE_SCHEMA_AND_BUILD_DICT_PREFIX
- See Also:
-
SSL_PREFIX
- See Also:
-
STORAGE_QUOTA_PROP
- See Also:
-
STORAGE_ENGINE_OVERHEAD_RATIO
- See Also:
-
VSON_PUSH
Deprecated.- See Also:
-
COMPRESSION_STRATEGY
- See Also:
-
KAFKA_INPUT_SOURCE_COMPRESSION_STRATEGY
- See Also:
-
SSL_CONFIGURATOR_CLASS_CONFIG
- See Also:
-
SSL_KEY_STORE_PROPERTY_NAME
- See Also:
-
SSL_TRUST_STORE_PROPERTY_NAME
- See Also:
-
SSL_KEY_STORE_PASSWORD_PROPERTY_NAME
- See Also:
-
SSL_KEY_PASSWORD_PROPERTY_NAME
- See Also:
-
JOB_EXEC_URL
This will define the execution servers url for easy access to the job execution during debugging. This can be any string that is meaningful to the execution environment.- See Also:
-
JOB_SERVER_NAME
The short name of the server where the job runs- See Also:
-
JOB_EXEC_ID
The execution ID of the execution if this job is a part of a multi-step flow. Each step in the flow can have the same execution id- See Also:
-
REDUCER_SPECULATIVE_EXECUTION_ENABLE
- See Also:
-
TELEMETRY_MESSAGE_INTERVAL
The interval of number of messages upon which certain info is printed in the reducer logs.- See Also:
-
ZSTD_COMPRESSION_LEVEL
Config to control the Compression Level for ZSTD Dictionary Compression.- See Also:
-
DEFAULT_BATCH_BYTES_SIZE
public static final int DEFAULT_BATCH_BYTES_SIZE- See Also:
-
DEFAULT_RE_PUSH_REWIND_IN_SECONDS_OVERRIDE
public static final long DEFAULT_RE_PUSH_REWIND_IN_SECONDS_OVERRIDEThe rewind override when performing re-push to prevent data loss; if the store has higher rewind config setting than 1 days, adopt the store config instead; otherwise, override the rewind config to 1 day if push job config doesn't try to override it.- See Also:
-
REPUSH_TTL_ENABLE
Config to control the TTL behaviors in repush.- See Also:
-
REPUSH_TTL_POLICY
- See Also:
-
REPUSH_TTL_SECONDS
- See Also:
-
REPUSH_TTL_START_TIMESTAMP
- See Also:
-
RMD_SCHEMA_DIR
- See Also:
-
VALUE_SCHEMA_DIR
- See Also:
-
NOT_SET
public static final int NOT_SET- See Also:
-
TARGETED_REGION_PUSH_ENABLED
Config to enable single targeted region push mode in VPJ. In this mode, the VPJ will only push data to a single region. The single region is decided by the store config inStoreInfo.getNativeReplicationSourceFabric()
}. For multiple targeted regions push, may use the advanced mode. SeeTARGETED_REGION_PUSH_LIST
.- See Also:
-
HYBRID_BATCH_WRITE_OPTIMIZATION_ENABLED
Config to enable memtable based ingestion of hybrid store batch push. In this mode servers will not use SST table writer to ingest batch data for hybrid store stores. This will help in preventing log compaction of contol messages from speculative producers.- See Also:
-
TARGETED_REGION_PUSH_LIST
This is experimental config to specify a list of regions used for targeted region push in VPJ.TARGETED_REGION_PUSH_ENABLED
has to be enabled to use this config. In this mode, the VPJ will only push data to the provided regions. The input is comma separated list of region names, e.g. "dc-0,dc-1,dc-2". For single targeted region push, seeTARGETED_REGION_PUSH_ENABLED
.- See Also:
-
TARGETED_REGION_PUSH_WITH_DEFERRED_SWAP
Config to enable target region push with deferred version swap In this mode, the VPJ will push data to all regions and only switch to the new version in a single region. The single region is decided by the store config inStoreInfo.getNativeReplicationSourceFabric()
} unless a list of regions is passed in from targeted.region.push.list After a specified wait time (default 1h), the remaining regions will switch to the new version.- See Also:
-
DEFAULT_IS_DUPLICATED_KEY_ALLOWED
public static final boolean DEFAULT_IS_DUPLICATED_KEY_ALLOWED- See Also:
-
MAP_REDUCE_PARTITIONER_CLASS_CONFIG
Config used only for tests. Should not be used at regular runtime.- See Also:
-
UNCREATED_VERSION_NUMBER
public static final int UNCREATED_VERSION_NUMBERPlaceholder for version number that is yet to be created.- See Also:
-
DEFAULT_POLL_STATUS_INTERVAL_MS
public static final long DEFAULT_POLL_STATUS_INTERVAL_MS- See Also:
-
DEFAULT_JOB_STATUS_IN_UNKNOWN_STATE_TIMEOUT_MS
public static final long DEFAULT_JOB_STATUS_IN_UNKNOWN_STATE_TIMEOUT_MSThe default total time we wait before failing a job if the job status stays in UNKNOWN state.- See Also:
-
NON_CRITICAL_EXCEPTION
- See Also:
-
COMPRESSION_DICTIONARY_SAMPLE_SIZE
Sample size to collect for building dictionary: Can be assigned a max of 2GB asZstdDictTrainer
in ZSTD library takes in sample size as int- See Also:
-
DEFAULT_COMPRESSION_DICTIONARY_SAMPLE_SIZE
public static final int DEFAULT_COMPRESSION_DICTIONARY_SAMPLE_SIZE- See Also:
-
COMPRESSION_DICTIONARY_SIZE_LIMIT
Maximum final dictionary size TODO add more details about the current limits- See Also:
-
DATA_WRITER_COMPUTE_JOB_CLASS
Config to set the class for the DataWriter job. When using KIF, we currently will continue to fall back to MR mode. The class must extendDataWriterComputeJob
and have a zero-arg constructor.- See Also:
-
PUSH_TO_SEPARATE_REALTIME_TOPIC
- See Also:
-
ALLOW_REGULAR_PUSH_WITH_TTL_REPUSH
Currently regular batch pushes are not compatible with TTL re-push enabled stores. This is because a regular batch push does not provide any RMD to be used for TTL. You can use the TIMESTAMP_FIELD_PROP to provide record level timestamp to perform compatible batch push or use this setting to override the batch push and TTL re-push check.- See Also:
-