Class VeniceReducer
- java.lang.Object
-
- com.linkedin.venice.hadoop.task.datawriter.AbstractDataWriterTask
-
- com.linkedin.venice.hadoop.task.datawriter.AbstractPartitionWriter
-
- com.linkedin.venice.hadoop.mapreduce.datawriter.reduce.VeniceReducer
-
- All Implemented Interfaces:
java.io.Closeable
,java.lang.AutoCloseable
,org.apache.hadoop.io.Closeable
,org.apache.hadoop.mapred.JobConfigurable
,org.apache.hadoop.mapred.Reducer<org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable>
- Direct Known Subclasses:
VeniceKafkaInputReducer
public class VeniceReducer extends AbstractPartitionWriter implements org.apache.hadoop.mapred.Reducer<org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable>
VeniceReducer
will be in charge of producing the messages to Kafka broker. SinceVeniceMRPartitioner
is using the same logic ofDefaultVenicePartitioner
, all the messages in the same reducer belongs to the same topic partition. The reason to introduce a reduce phase is that BDB-JE will benefit with sorted input in the following ways: 1. BDB-JE won't generate so many BINDelta since it won't touch a lot of BINs at a time; 2. The overall BDB-JE insert rate will improve a lot since the disk usage will be reduced a lot (BINDelta will be much smaller than before);
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class com.linkedin.venice.hadoop.task.datawriter.AbstractPartitionWriter
AbstractPartitionWriter.DuplicateKeyPrinter, AbstractPartitionWriter.PartitionWriterProducerCallback, AbstractPartitionWriter.VeniceWriterMessage
-
-
Field Summary
Fields Modifier and Type Field Description static java.lang.String
MAP_REDUCE_JOB_ID_PROP
-
Fields inherited from class com.linkedin.venice.hadoop.task.datawriter.AbstractDataWriterTask
TASK_ID_NOT_SET
-
-
Constructor Summary
Constructors Constructor Description VeniceReducer()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
configure(org.apache.hadoop.mapred.JobConf job)
protected void
configureTask(VeniceProperties props)
Allow implementations of this class to configure task-specific stuff.protected PubSubProducerCallback
getCallback()
protected DataWriterTaskTracker
getDataWriterTaskTracker()
protected boolean
getExceedQuotaFlag()
protected org.apache.hadoop.mapred.JobConf
getJobConf()
protected long
getTotalIncomingDataSizeInBytes()
Return the size of serialized key and serialized value in bytes across the entire dataset.protected boolean
hasReportedFailure(DataWriterTaskTracker dataWriterTaskTracker, boolean isDuplicateKeyAllowed)
void
reduce(org.apache.hadoop.io.BytesWritable key, java.util.Iterator<org.apache.hadoop.io.BytesWritable> values, org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable> output, org.apache.hadoop.mapred.Reporter reporter)
protected void
setExceedQuota(boolean exceedQuota)
protected void
setHadoopJobClientProvider(HadoopJobClientProvider hadoopJobClientProvider)
protected void
setVeniceWriter(AbstractVeniceWriter veniceWriter)
-
Methods inherited from class com.linkedin.venice.hadoop.task.datawriter.AbstractPartitionWriter
close, extract, getDerivedValueSchemaId, initDuplicateKeyPrinter, isEnableWriteCompute, logMessageProgress, processValuesForKey, recordMessageErrored
-
Methods inherited from class com.linkedin.venice.hadoop.task.datawriter.AbstractDataWriterTask
configure, getEngineTaskConfigProvider, getPartitionCount, getTaskId, isChunkingEnabled, isRmdChunkingEnabled, setChunkingEnabled
-
-
-
-
Field Detail
-
MAP_REDUCE_JOB_ID_PROP
public static final java.lang.String MAP_REDUCE_JOB_ID_PROP
- See Also:
- Constant Field Values
-
-
Method Detail
-
getJobConf
protected org.apache.hadoop.mapred.JobConf getJobConf()
-
configure
public void configure(org.apache.hadoop.mapred.JobConf job)
- Specified by:
configure
in interfaceorg.apache.hadoop.mapred.JobConfigurable
-
reduce
public void reduce(org.apache.hadoop.io.BytesWritable key, java.util.Iterator<org.apache.hadoop.io.BytesWritable> values, org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable> output, org.apache.hadoop.mapred.Reporter reporter)
- Specified by:
reduce
in interfaceorg.apache.hadoop.mapred.Reducer<org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable>
-
getDataWriterTaskTracker
protected DataWriterTaskTracker getDataWriterTaskTracker()
- Overrides:
getDataWriterTaskTracker
in classAbstractPartitionWriter
-
getExceedQuotaFlag
protected boolean getExceedQuotaFlag()
- Overrides:
getExceedQuotaFlag
in classAbstractPartitionWriter
-
setVeniceWriter
protected void setVeniceWriter(AbstractVeniceWriter veniceWriter)
- Overrides:
setVeniceWriter
in classAbstractPartitionWriter
-
setExceedQuota
protected void setExceedQuota(boolean exceedQuota)
- Overrides:
setExceedQuota
in classAbstractPartitionWriter
-
hasReportedFailure
protected boolean hasReportedFailure(DataWriterTaskTracker dataWriterTaskTracker, boolean isDuplicateKeyAllowed)
- Overrides:
hasReportedFailure
in classAbstractPartitionWriter
-
getCallback
protected PubSubProducerCallback getCallback()
- Overrides:
getCallback
in classAbstractPartitionWriter
-
configureTask
protected void configureTask(VeniceProperties props)
Description copied from class:AbstractDataWriterTask
Allow implementations of this class to configure task-specific stuff.- Overrides:
configureTask
in classAbstractPartitionWriter
- Parameters:
props
- the job props that the task was configured with.
-
getTotalIncomingDataSizeInBytes
protected long getTotalIncomingDataSizeInBytes()
Description copied from class:AbstractPartitionWriter
Return the size of serialized key and serialized value in bytes across the entire dataset. This is an optimization to skip writing the data to Kafka and reduce the load on Kafka and Venice storage nodes. Not all engines can support fetching this information during the execution of the job (eg Spark), but we can live with it for now. The quota is checked again in the Driver after the completion of the DataWriter job, and it will kill the VenicePushJob soon after.- Overrides:
getTotalIncomingDataSizeInBytes
in classAbstractPartitionWriter
- Returns:
- the size of serialized key and serialized value in bytes across the entire dataset
-
setHadoopJobClientProvider
protected void setHadoopJobClientProvider(HadoopJobClientProvider hadoopJobClientProvider)
-
-