Class VeniceReducer

  • All Implemented Interfaces:
    java.io.Closeable, java.lang.AutoCloseable, org.apache.hadoop.io.Closeable, org.apache.hadoop.mapred.JobConfigurable, org.apache.hadoop.mapred.Reducer<org.apache.hadoop.io.BytesWritable,​org.apache.hadoop.io.BytesWritable,​org.apache.hadoop.io.BytesWritable,​org.apache.hadoop.io.BytesWritable>
    Direct Known Subclasses:
    VeniceKafkaInputReducer

    public class VeniceReducer
    extends AbstractPartitionWriter
    implements org.apache.hadoop.mapred.Reducer<org.apache.hadoop.io.BytesWritable,​org.apache.hadoop.io.BytesWritable,​org.apache.hadoop.io.BytesWritable,​org.apache.hadoop.io.BytesWritable>
    VeniceReducer will be in charge of producing the messages to Kafka broker. Since VeniceMRPartitioner is using the same logic of DefaultVenicePartitioner, all the messages in the same reducer belongs to the same topic partition. The reason to introduce a reduce phase is that BDB-JE will benefit with sorted input in the following ways: 1. BDB-JE won't generate so many BINDelta since it won't touch a lot of BINs at a time; 2. The overall BDB-JE insert rate will improve a lot since the disk usage will be reduced a lot (BINDelta will be much smaller than before);
    • Field Detail

      • MAP_REDUCE_JOB_ID_PROP

        public static final java.lang.String MAP_REDUCE_JOB_ID_PROP
        See Also:
        Constant Field Values
    • Constructor Detail

      • VeniceReducer

        public VeniceReducer()
    • Method Detail

      • getJobConf

        protected org.apache.hadoop.mapred.JobConf getJobConf()
      • configure

        public void configure​(org.apache.hadoop.mapred.JobConf job)
        Specified by:
        configure in interface org.apache.hadoop.mapred.JobConfigurable
      • reduce

        public void reduce​(org.apache.hadoop.io.BytesWritable key,
                           java.util.Iterator<org.apache.hadoop.io.BytesWritable> values,
                           org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.BytesWritable,​org.apache.hadoop.io.BytesWritable> output,
                           org.apache.hadoop.mapred.Reporter reporter)
        Specified by:
        reduce in interface org.apache.hadoop.mapred.Reducer<org.apache.hadoop.io.BytesWritable,​org.apache.hadoop.io.BytesWritable,​org.apache.hadoop.io.BytesWritable,​org.apache.hadoop.io.BytesWritable>
      • getTotalIncomingDataSizeInBytes

        protected long getTotalIncomingDataSizeInBytes()
        Description copied from class: AbstractPartitionWriter
        Return the size of serialized key and serialized value in bytes across the entire dataset. This is an optimization to skip writing the data to Kafka and reduce the load on Kafka and Venice storage nodes. Not all engines can support fetching this information during the execution of the job (eg Spark), but we can live with it for now. The quota is checked again in the Driver after the completion of the DataWriter job, and it will kill the VenicePushJob soon after.
        Overrides:
        getTotalIncomingDataSizeInBytes in class AbstractPartitionWriter
        Returns:
        the size of serialized key and serialized value in bytes across the entire dataset
      • setHadoopJobClientProvider

        protected void setHadoopJobClientProvider​(HadoopJobClientProvider hadoopJobClientProvider)