Class VeniceReducer
java.lang.Object
com.linkedin.venice.hadoop.task.datawriter.AbstractDataWriterTask
com.linkedin.venice.hadoop.task.datawriter.AbstractPartitionWriter
com.linkedin.venice.hadoop.mapreduce.datawriter.reduce.VeniceReducer
- All Implemented Interfaces:
Closeable,AutoCloseable,org.apache.hadoop.io.Closeable,org.apache.hadoop.mapred.JobConfigurable,org.apache.hadoop.mapred.Reducer<org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable, org.apache.hadoop.io.BytesWritable, org.apache.hadoop.io.BytesWritable>
- Direct Known Subclasses:
VeniceKafkaInputReducer
public class VeniceReducer
extends AbstractPartitionWriter
implements org.apache.hadoop.mapred.Reducer<org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable>
VeniceReducer will be in charge of producing the messages to Kafka broker.
Since VeniceMRPartitioner is using the same logic of DefaultVenicePartitioner,
all the messages in the same reducer belongs to the same topic partition.
The reason to introduce a reduce phase is that BDB-JE will benefit with sorted input in the following ways:
1. BDB-JE won't generate so many BINDelta since it won't touch a lot of BINs at a time;
2. The overall BDB-JE insert rate will improve a lot since the disk usage will be reduced a lot (BINDelta will be
much smaller than before);-
Nested Class Summary
Nested classes/interfaces inherited from class com.linkedin.venice.hadoop.task.datawriter.AbstractPartitionWriter
AbstractPartitionWriter.ChildWriterProducerCallback, AbstractPartitionWriter.DuplicateKeyPrinter, AbstractPartitionWriter.PartitionWriterProducerCallback, AbstractPartitionWriter.VeniceRecordWithMetadata, AbstractPartitionWriter.VeniceWriterMessage -
Field Summary
FieldsFields inherited from class com.linkedin.venice.hadoop.task.datawriter.AbstractDataWriterTask
TASK_ID_NOT_SET -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionvoidconfigure(org.apache.hadoop.mapred.JobConf job) protected voidconfigureTask(VeniceProperties props) Allow implementations of this class to configure task-specific stuff.protected AbstractVeniceWriter<byte[],byte[], byte[]> protected PubSubProducerCallbackprotected DataWriterTaskTrackerprotected booleanprotected org.apache.hadoop.mapred.JobConfprotected longReturn the size of serialized key and serialized value in bytes across the entire dataset.protected booleanhasReportedFailure(DataWriterTaskTracker dataWriterTaskTracker, boolean isDuplicateKeyAllowed) voidreduce(org.apache.hadoop.io.BytesWritable key, Iterator<org.apache.hadoop.io.BytesWritable> values, org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.BytesWritable, org.apache.hadoop.io.BytesWritable> output, org.apache.hadoop.mapred.Reporter reporter) protected voidsetExceedQuota(boolean exceedQuota) protected voidsetHadoopJobClientProvider(HadoopJobClientProvider hadoopJobClientProvider) protected voidsetVeniceWriter(AbstractVeniceWriter veniceWriter) protected voidMethods inherited from class com.linkedin.venice.hadoop.task.datawriter.AbstractPartitionWriter
close, extract, getDerivedValueSchemaId, getRmdSchema, getVeniceWriterFactory, initDuplicateKeyPrinter, isEnableWriteCompute, logMessageProgress, processValuesForKey, recordMessageErroredMethods inherited from class com.linkedin.venice.hadoop.task.datawriter.AbstractDataWriterTask
configure, getEngineTaskConfigProvider, getPartitionCount, getTaskId, isChunkingEnabled, isRmdChunkingEnabled, setChunkingEnabled
-
Field Details
-
MAP_REDUCE_JOB_ID_PROP
- See Also:
-
-
Constructor Details
-
VeniceReducer
public VeniceReducer()
-
-
Method Details
-
getJobConf
protected org.apache.hadoop.mapred.JobConf getJobConf() -
configure
public void configure(org.apache.hadoop.mapred.JobConf job) - Specified by:
configurein interfaceorg.apache.hadoop.mapred.JobConfigurable
-
reduce
public void reduce(org.apache.hadoop.io.BytesWritable key, Iterator<org.apache.hadoop.io.BytesWritable> values, org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.BytesWritable, org.apache.hadoop.io.BytesWritable> output, org.apache.hadoop.mapred.Reporter reporter) - Specified by:
reducein interfaceorg.apache.hadoop.mapred.Reducer<org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable, org.apache.hadoop.io.BytesWritable, org.apache.hadoop.io.BytesWritable>
-
getDataWriterTaskTracker
- Overrides:
getDataWriterTaskTrackerin classAbstractPartitionWriter
-
getExceedQuotaFlag
protected boolean getExceedQuotaFlag()- Overrides:
getExceedQuotaFlagin classAbstractPartitionWriter
-
setVeniceWriter
- Overrides:
setVeniceWriterin classAbstractPartitionWriter
-
setExceedQuota
protected void setExceedQuota(boolean exceedQuota) - Overrides:
setExceedQuotain classAbstractPartitionWriter
-
hasReportedFailure
protected boolean hasReportedFailure(DataWriterTaskTracker dataWriterTaskTracker, boolean isDuplicateKeyAllowed) - Overrides:
hasReportedFailurein classAbstractPartitionWriter
-
getCallback
- Overrides:
getCallbackin classAbstractPartitionWriter
-
configureTask
Description copied from class:AbstractDataWriterTaskAllow implementations of this class to configure task-specific stuff.- Overrides:
configureTaskin classAbstractPartitionWriter- Parameters:
props- the job props that the task was configured with.
-
getTotalIncomingDataSizeInBytes
protected long getTotalIncomingDataSizeInBytes()Description copied from class:AbstractPartitionWriterReturn the size of serialized key and serialized value in bytes across the entire dataset. This is an optimization to skip writing the data to Kafka and reduce the load on Kafka and Venice storage nodes. Not all engines can support fetching this information during the execution of the job (eg Spark), but we can live with it for now. The quota is checked again in the Driver after the completion of the DataWriter job, and it will kill the VenicePushJob soon after.- Overrides:
getTotalIncomingDataSizeInBytesin classAbstractPartitionWriter- Returns:
- the size of serialized key and serialized value in bytes across the entire dataset
-
setHadoopJobClientProvider
-
createBasicVeniceWriter
- Overrides:
createBasicVeniceWriterin classAbstractPartitionWriter
-
setVeniceWriterFactory
- Overrides:
setVeniceWriterFactoryin classAbstractPartitionWriter
-