Package com.linkedin.venice.vpj
Class VenicePushJobConstants
java.lang.Object
com.linkedin.venice.vpj.VenicePushJobConstants
-
Field Summary
Modifier and TypeFieldDescriptionstatic final String
static final String
static final String
Sample size to collect for building dictionary: Can be assigned a max of 2GB asZstdDictTrainer
in ZSTD library takes in sample size as intstatic final String
Maximum final dictionary size TODO add more details about the current limitsstatic final String
Config to enable/disable the feature to collect extra metrics wrt compression.static final String
static final String
static final String
This config specifies the prefix for d2 zk hosts config.static final String
Config to set the class for the DataWriter job.static final int
static final int
static final boolean
static final boolean
static final boolean
static final long
The default total time we wait before failing a job if the job status stays in UNKNOWN state.static final String
static final long
static final long
The rewind override when performing re-push to prevent data loss; if the store has higher rewind config setting than 1 days, adopt the store config instead; otherwise, override the rewind config to 1 day if push job config doesn't try to override it.static final boolean
static final boolean
static final String
static final String
This config is a boolean which waits for an external signal to trigger version swap after buffer replay is complete.static final String
static final String
static final String
static final String
static final String
static final String
static final String
static final String
static final String
static final String
static final String
static final String
static final String
static final String
static final String
The execution ID of the execution if this job is a part of a multi-step flow.static final String
This will define the execution servers url for easy access to the job execution during debugging.static final String
The short name of the server where the job runsstatic final String
static final String
static final String
static final String
static final String
static final String
static final String
static final String
static final String
TODO: consider to automatically discover the source topic for the specified store.static final String
static final String
static final String
static final String
static final String
static final String
static final String
static final String
Config used only for tests.static final int
Known zstd lib issue which crashes if the input sample is too small.static final String
static final int
static final String
This config specifies the region identifier where parent controller is runningstatic final String
static final org.apache.hadoop.fs.PathFilter
ignore hdfs files with prefix "_" and "."static final org.apache.hadoop.fs.permission.FsPermission
static final org.apache.hadoop.fs.permission.FsPermission
static final String
static final String
static final String
Config to enable the service that uploads push job statuses to the controller usingControllerClient.uploadPushJobStatus()
, the job status is then packaged and sent to dedicated Kafka channel.static final String
static final String
static final String
Config to control the TTL behaviors in repush.static final String
static final String
static final String
static final String
Relates to the above argument.static final String
A time stamp specified to rewind to before replaying data.static final String
Optional.static final String
static final String
static final String
static final boolean
static final String
static final String
An identifier of the data center which is used to determine the Kafka URL and child controllers that push jobs communicate withstatic final String
Configs used to enable Kafka Input.static final String
static final String
static final String
static final String
static final String
static final String
static final String
static final String
static final String
static final String
This config is a boolean which suppresses submitting the end of push message after data has been sent and does not poll for the status of the job to complete.static final String
static final String
static final String
static final String
Config to enable single targeted region push mode in VPJ.static final String
This is experimental config to specify a list of regions used for targeted region push in VPJ.static final String
Config to enable target region push with deferred version swap In this mode, the VPJ will push data to all regions and only switch to the new version in a single region.static final String
The interval of number of messages upon which certain info is printed in the reducer logs.static final String
static final String
static final int
Placeholder for version number that is yet to be created.static final String
static final String
Config to enable/disable using mapper to do the below which are currently done in VPJ driver
1.static final String
Location and key to store the output ofValidateSchemaAndBuildDictMapper
and retrieve it back when USE_MAPPER_TO_BUILD_DICTIONARY is enabledstatic final String
static final String
static final String
static final String
static final String
static final String
In single-region mode, this must be a comma-separated list of child controller URLs or d2://<d2ServiceNameForChildController> In multi-region mode, it must be a comma-separated list of parent controller URLs or d2://<d2ServiceNameForParentController>static final String
static final String
Deprecated.static final String
Config to control the Compression Level for ZSTD Dictionary Compression.static final String
Configs to pass toAbstractVeniceMapper
based on the input configs and Dictionary training statusstatic final String
-
Method Summary
-
Field Details
-
LEGACY_AVRO_KEY_FIELD_PROP
- See Also:
-
LEGACY_AVRO_VALUE_FIELD_PROP
- See Also:
-
KEY_FIELD_PROP
- See Also:
-
VALUE_FIELD_PROP
- See Also:
-
DEFAULT_KEY_FIELD_PROP
- See Also:
-
DEFAULT_VALUE_FIELD_PROP
- See Also:
-
DEFAULT_SSL_ENABLED
public static final boolean DEFAULT_SSL_ENABLED- See Also:
-
SCHEMA_STRING_PROP
- See Also:
-
KAFKA_SOURCE_KEY_SCHEMA_STRING_PROP
- See Also:
-
EXTENDED_SCHEMA_VALIDITY_CHECK_ENABLED
- See Also:
-
DEFAULT_EXTENDED_SCHEMA_VALIDITY_CHECK_ENABLED
public static final boolean DEFAULT_EXTENDED_SCHEMA_VALIDITY_CHECK_ENABLED- See Also:
-
UPDATE_SCHEMA_STRING_PROP
- See Also:
-
SPARK_NATIVE_INPUT_FORMAT_ENABLED
- See Also:
-
FILE_KEY_SCHEMA
- See Also:
-
FILE_VALUE_SCHEMA
- See Also:
-
INCREMENTAL_PUSH
- See Also:
-
GENERATE_PARTIAL_UPDATE_RECORD_FROM_INPUT
- See Also:
-
PARTITION_COUNT
- See Also:
-
ALLOW_DUPLICATE_KEY
- See Also:
-
POLL_STATUS_RETRY_ATTEMPTS
- See Also:
-
CONTROLLER_REQUEST_RETRY_ATTEMPTS
- See Also:
-
POLL_JOB_STATUS_INTERVAL_MS
- See Also:
-
JOB_STATUS_IN_UNKNOWN_STATE_TIMEOUT_MS
- See Also:
-
SEND_CONTROL_MESSAGES_DIRECTLY
- See Also:
-
SOURCE_ETL
- See Also:
-
ETL_VALUE_SCHEMA_TRANSFORMATION
- See Also:
-
SYSTEM_SCHEMA_READER_ENABLED
- See Also:
-
SYSTEM_SCHEMA_CLUSTER_D2_SERVICE_NAME
- See Also:
-
SYSTEM_SCHEMA_CLUSTER_D2_ZK_HOST
- See Also:
-
COMPRESSION_METRIC_COLLECTION_ENABLED
Config to enable/disable the feature to collect extra metrics wrt compression. Enabling this collects metrics for all compression strategies regardless of the configured compression strategy. This means: zstd dictionary will be created even ifCompressionStrategy.ZSTD_WITH_DICT
is not the configured store compression strategy (referVenicePushJob.shouldBuildZstdCompressionDictionary(com.linkedin.venice.hadoop.PushJobSetting, boolean)
)
This config also gets evaluated inVenicePushJob.evaluateCompressionMetricCollectionEnabled(com.linkedin.venice.hadoop.PushJobSetting, boolean)
- See Also:
-
DEFAULT_COMPRESSION_METRIC_COLLECTION_ENABLED
public static final boolean DEFAULT_COMPRESSION_METRIC_COLLECTION_ENABLED- See Also:
-
USE_MAPPER_TO_BUILD_DICTIONARY
Config to enable/disable using mapper to do the below which are currently done in VPJ driver
1. validate schema,
2. collect the input data size
3. build dictionary (if needed: referVenicePushJob.shouldBuildZstdCompressionDictionary(com.linkedin.venice.hadoop.PushJobSetting, boolean)
)
This new mapper was added because the sample collection for Zstd dictionary is currently in-memory and to help play around with the sample size and also to support future enhancements if needed.
Currently, the plan is to only have it available for MapReduce compute engine and remove it eventually as we remove the MapReduce-related codes.- See Also:
-
DEFAULT_USE_MAPPER_TO_BUILD_DICTIONARY
public static final boolean DEFAULT_USE_MAPPER_TO_BUILD_DICTIONARY- See Also:
-
MINIMUM_NUMBER_OF_SAMPLES_REQUIRED_TO_BUILD_ZSTD_DICTIONARY
public static final int MINIMUM_NUMBER_OF_SAMPLES_REQUIRED_TO_BUILD_ZSTD_DICTIONARYKnown zstd lib issue which crashes if the input sample is too small. So adding a preventive check to skip training the dictionary in such cases using a minimum limit of 20. Keeping it simple and hard coding it as if this check doesn't prevent some edge cases then we can disable the feature itself- See Also:
-
ZSTD_DICTIONARY_CREATION_REQUIRED
Configs to pass toAbstractVeniceMapper
based on the input configs and Dictionary training status- See Also:
-
ZSTD_DICTIONARY_CREATION_SUCCESS
- See Also:
-
VALIDATE_SCHEMA_AND_BUILD_DICT_MAPPER_OUTPUT_DIRECTORY
Location and key to store the output ofValidateSchemaAndBuildDictMapper
and retrieve it back when USE_MAPPER_TO_BUILD_DICTIONARY is enabled- See Also:
-
VALIDATE_SCHEMA_AND_BUILD_DICTIONARY_MAPPER_OUTPUT_FILE_PREFIX
- See Also:
-
VALIDATE_SCHEMA_AND_BUILD_DICTIONARY_MAPPER_OUTPUT_FILE_EXTENSION
- See Also:
-
KEY_ZSTD_COMPRESSION_DICTIONARY
- See Also:
-
KEY_INPUT_FILE_DATA_SIZE
- See Also:
-
SOURCE_KAFKA
Configs used to enable Kafka Input.- See Also:
-
KAFKA_INPUT_TOPIC
TODO: consider to automatically discover the source topic for the specified store. We need to be careful in the following scenarios: 1. Not all the prod colos are using the same current version if the previous push experiences a partial failure. 2. We might want to re-push from a backup version, which should be unlikely.- See Also:
-
KAFKA_INPUT_FABRIC
- See Also:
-
KAFKA_INPUT_BROKER_URL
- See Also:
-
KAFKA_INPUT_MAX_RECORDS_PER_MAPPER
- See Also:
-
KAFKA_INPUT_COMBINER_ENABLED
- See Also:
-
KAFKA_INPUT_COMPRESSION_BUILD_NEW_DICT_ENABLED
- See Also:
-
KAFKA_INPUT_SOURCE_TOPIC_CHUNKING_ENABLED
- See Also:
-
REWIND_TIME_IN_SECONDS_OVERRIDE
Optional. If we want to use a different rewind time from the default store-level rewind time config for Kafka Input re-push, the following property needs to specified explicitly. This property comes to play when the default rewind time configured in store-level is too short or too long. 1. If the default rewind time config is too short (for example 0 or several mins), it could cause data gap with re-push since the push job itself could take several hours, and we would like to make sure the re-pushed version will contain the same dataset as the source version. 2. If the default rewind time config is too long (such as 28 days), it will be a big waste to rewind so much time since the time gap between the source version and the re-push version should be comparable to the re-push time if the whole ingestion pipeline is not lagging. There are some challenges to automatically detect the right rewind time for re-push because of the following reasons: 1. For Venice Aggregate use case, some colo could be lagging behind other prod colos, so if the re-push source is from a fast colo, too short rewind time could cause a data gap in the slower colos. Ideally, it is good to use the slowest colo as the re-push source. 2. For Venice non-Aggregate use case, the ingestion pipeline will include the following several phases: 2.1 Customer's Kafka aggregation and mirroring pipeline to replicate the same data to all prod colos. 2.2 Venice Ingestion pipeline to consume the local real-time topic. We have visibility to 2.2, but not 2.1, so we may need to work with customer to understand how 2.1 can be measured or use a long enough rewind time to mitigate all the potential issues. Make this property available in generic since it should be useful for ETL+VPJ use case as well.- See Also:
-
REWIND_EPOCH_TIME_IN_SECONDS_OVERRIDE
A time stamp specified to rewind to before replaying data. This config is ignored if rewind.time.in.seconds.override is provided. This config at time of push will be leveraged to fill in the rewind.time.in.seconds.override by taking System.currentTime - rewind.epoch.time.in.seconds.override and storing the result in rewind.time.in.seconds.override. With this in mind, a push policy of REWIND_FROM_SOP should be used in order to get a behavior that makes sense to a user. A timestamp that is in the future is not valid and will result in an exception.- See Also:
-
SUPPRESS_END_OF_PUSH_MESSAGE
This config is a boolean which suppresses submitting the end of push message after data has been sent and does not poll for the status of the job to complete. Using this flag means that a user must manually mark the job success or failed.- See Also:
-
DEFER_VERSION_SWAP
This config is a boolean which waits for an external signal to trigger version swap after buffer replay is complete.- See Also:
-
D2_ZK_HOSTS_PREFIX
This config specifies the prefix for d2 zk hosts config. Configs of type <prefix>.<regionName> are expected to be defined.- See Also:
-
PARENT_CONTROLLER_REGION_NAME
This config specifies the region identifier where parent controller is running- See Also:
-
REWIND_EPOCH_TIME_BUFFER_IN_SECONDS_OVERRIDE
Relates to the above argument. An overridable amount of buffer to be applied to the epoch (as the rewind isn't perfectly instantaneous). Defaults to 1 minute.- See Also:
-
VENICE_DISCOVER_URL_PROP
In single-region mode, this must be a comma-separated list of child controller URLs or d2://<d2ServiceNameForChildController> In multi-region mode, it must be a comma-separated list of parent controller URLs or d2://<d2ServiceNameForParentController>- See Also:
-
SOURCE_GRID_FABRIC
An identifier of the data center which is used to determine the Kafka URL and child controllers that push jobs communicate with- See Also:
-
ENABLE_WRITE_COMPUTE
- See Also:
-
ENABLE_SSL
- See Also:
-
VENICE_STORE_NAME_PROP
- See Also:
-
INPUT_PATH_PROP
- See Also:
-
INPUT_PATH_LAST_MODIFIED_TIME
- See Also:
-
BATCH_NUM_BYTES_PROP
- See Also:
-
PATH_FILTER
public static final org.apache.hadoop.fs.PathFilter PATH_FILTERignore hdfs files with prefix "_" and "." -
GLOB_FILTER_PATTERN
- See Also:
-
HADOOP_TMP_DIR
- See Also:
-
TEMP_DIR_PREFIX
- See Also:
-
PERMISSION_777
public static final org.apache.hadoop.fs.permission.FsPermission PERMISSION_777 -
PERMISSION_700
public static final org.apache.hadoop.fs.permission.FsPermission PERMISSION_700 -
VALUE_SCHEMA_ID_PROP
- See Also:
-
DERIVED_SCHEMA_ID_PROP
- See Also:
-
TOPIC_PROP
- See Also:
-
HADOOP_VALIDATE_SCHEMA_AND_BUILD_DICT_PREFIX
- See Also:
-
SSL_PREFIX
- See Also:
-
STORAGE_QUOTA_PROP
- See Also:
-
STORAGE_ENGINE_OVERHEAD_RATIO
- See Also:
-
VSON_PUSH
Deprecated.- See Also:
-
KAFKA_SECURITY_PROTOCOL
- See Also:
-
COMPRESSION_STRATEGY
- See Also:
-
KAFKA_INPUT_SOURCE_COMPRESSION_STRATEGY
- See Also:
-
SSL_CONFIGURATOR_CLASS_CONFIG
- See Also:
-
SSL_KEY_STORE_PROPERTY_NAME
- See Also:
-
SSL_TRUST_STORE_PROPERTY_NAME
- See Also:
-
SSL_KEY_STORE_PASSWORD_PROPERTY_NAME
- See Also:
-
SSL_KEY_PASSWORD_PROPERTY_NAME
- See Also:
-
JOB_EXEC_URL
This will define the execution servers url for easy access to the job execution during debugging. This can be any string that is meaningful to the execution environment.- See Also:
-
JOB_SERVER_NAME
The short name of the server where the job runs- See Also:
-
JOB_EXEC_ID
The execution ID of the execution if this job is a part of a multi-step flow. Each step in the flow can have the same execution id- See Also:
-
PUSH_JOB_STATUS_UPLOAD_ENABLE
Config to enable the service that uploads push job statuses to the controller usingControllerClient.uploadPushJobStatus()
, the job status is then packaged and sent to dedicated Kafka channel.- See Also:
-
REDUCER_SPECULATIVE_EXECUTION_ENABLE
- See Also:
-
TELEMETRY_MESSAGE_INTERVAL
The interval of number of messages upon which certain info is printed in the reducer logs.- See Also:
-
ZSTD_COMPRESSION_LEVEL
Config to control the Compression Level for ZSTD Dictionary Compression.- See Also:
-
DEFAULT_BATCH_BYTES_SIZE
public static final int DEFAULT_BATCH_BYTES_SIZE- See Also:
-
SORTED
public static final boolean SORTED- See Also:
-
DEFAULT_RE_PUSH_REWIND_IN_SECONDS_OVERRIDE
public static final long DEFAULT_RE_PUSH_REWIND_IN_SECONDS_OVERRIDEThe rewind override when performing re-push to prevent data loss; if the store has higher rewind config setting than 1 days, adopt the store config instead; otherwise, override the rewind config to 1 day if push job config doesn't try to override it.- See Also:
-
REPUSH_TTL_ENABLE
Config to control the TTL behaviors in repush.- See Also:
-
REPUSH_TTL_POLICY
- See Also:
-
REPUSH_TTL_SECONDS
- See Also:
-
REPUSH_TTL_START_TIMESTAMP
- See Also:
-
RMD_SCHEMA_DIR
- See Also:
-
VALUE_SCHEMA_DIR
- See Also:
-
NOT_SET
public static final int NOT_SET- See Also:
-
TARGETED_REGION_PUSH_ENABLED
Config to enable single targeted region push mode in VPJ. In this mode, the VPJ will only push data to a single region. The single region is decided by the store config inStoreInfo.getNativeReplicationSourceFabric()
}. For multiple targeted regions push, may use the advanced mode. SeeTARGETED_REGION_PUSH_LIST
.- See Also:
-
TARGETED_REGION_PUSH_LIST
This is experimental config to specify a list of regions used for targeted region push in VPJ.TARGETED_REGION_PUSH_ENABLED
has to be enabled to use this config. In this mode, the VPJ will only push data to the provided regions. The input is comma separated list of region names, e.g. "dc-0,dc-1,dc-2". For single targeted region push, seeTARGETED_REGION_PUSH_ENABLED
.- See Also:
-
TARGETED_REGION_PUSH_WITH_DEFERRED_SWAP
Config to enable target region push with deferred version swap In this mode, the VPJ will push data to all regions and only switch to the new version in a single region. The single region is decided by the store config inStoreInfo.getNativeReplicationSourceFabric()
} unless a list of regions is passed in from targeted.region.push.list After a specified wait time (default 1h), the remaining regions will switch to the new version.- See Also:
-
DEFAULT_IS_DUPLICATED_KEY_ALLOWED
public static final boolean DEFAULT_IS_DUPLICATED_KEY_ALLOWED- See Also:
-
MAP_REDUCE_PARTITIONER_CLASS_CONFIG
Config used only for tests. Should not be used at regular runtime.- See Also:
-
UNCREATED_VERSION_NUMBER
public static final int UNCREATED_VERSION_NUMBERPlaceholder for version number that is yet to be created.- See Also:
-
DEFAULT_POLL_STATUS_INTERVAL_MS
public static final long DEFAULT_POLL_STATUS_INTERVAL_MS- See Also:
-
DEFAULT_JOB_STATUS_IN_UNKNOWN_STATE_TIMEOUT_MS
public static final long DEFAULT_JOB_STATUS_IN_UNKNOWN_STATE_TIMEOUT_MSThe default total time we wait before failing a job if the job status stays in UNKNOWN state.- See Also:
-
NON_CRITICAL_EXCEPTION
- See Also:
-
COMPRESSION_DICTIONARY_SAMPLE_SIZE
Sample size to collect for building dictionary: Can be assigned a max of 2GB asZstdDictTrainer
in ZSTD library takes in sample size as int- See Also:
-
DEFAULT_COMPRESSION_DICTIONARY_SAMPLE_SIZE
public static final int DEFAULT_COMPRESSION_DICTIONARY_SAMPLE_SIZE- See Also:
-
COMPRESSION_DICTIONARY_SIZE_LIMIT
Maximum final dictionary size TODO add more details about the current limits- See Also:
-
DATA_WRITER_COMPUTE_JOB_CLASS
Config to set the class for the DataWriter job. When using KIF, we currently will continue to fall back to MR mode. The class must extendDataWriterComputeJob
and have a zero-arg constructor.- See Also:
-
PUSH_TO_SEPARATE_REALTIME_TOPIC
- See Also:
-