Class KafkaInputFormatCombiner
java.lang.Object
com.linkedin.venice.hadoop.task.datawriter.AbstractDataWriterTask
com.linkedin.venice.hadoop.input.kafka.KafkaInputFormatCombiner
- All Implemented Interfaces:
Closeable,AutoCloseable,org.apache.hadoop.io.Closeable,org.apache.hadoop.mapred.JobConfigurable,org.apache.hadoop.mapred.Reducer<org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable, org.apache.hadoop.io.BytesWritable, org.apache.hadoop.io.BytesWritable>
public class KafkaInputFormatCombiner
extends AbstractDataWriterTask
implements org.apache.hadoop.mapred.Reducer<org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable>, org.apache.hadoop.mapred.JobConfigurable
This class is a Combiner, which is a functionality of the MR framework where we can plug a
Reducer
implementation to be executed within the Mapper task, on its output. This allows the Reducer to have less
work to do since part of it was already done in the Mappers. In the case of the KafkaInputFormat, we cannot
do all the same work that the VeniceKafkaInputReducer is doing, since that includes producing to Kafka, but
the part which is relevant to shift to Mappers is the compaction. We have observed that when the input partitions are
very large, then Reducers can run out of memory. By shifting the work to Mappers, we should be able to remove this
bottleneck and scale better, especially if used in combination with a low value for:
VenicePushJobConstants.KAFKA_INPUT_MAX_RECORDS_PER_MAPPER-
Field Summary
Fields inherited from class com.linkedin.venice.hadoop.task.datawriter.AbstractDataWriterTask
TASK_ID_NOT_SET -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionvoidclose()voidconfigure(org.apache.hadoop.mapred.JobConf job) protected voidconfigureTask(VeniceProperties props) Allow implementations of this class to configure task-specific stuff.voidreduce(org.apache.hadoop.io.BytesWritable key, Iterator<org.apache.hadoop.io.BytesWritable> values, org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.BytesWritable, org.apache.hadoop.io.BytesWritable> output, org.apache.hadoop.mapred.Reporter reporter) Methods inherited from class com.linkedin.venice.hadoop.task.datawriter.AbstractDataWriterTask
configure, getEngineTaskConfigProvider, getPartitionCount, getTaskId, isChunkingEnabled, isRmdChunkingEnabled, setChunkingEnabled
-
Constructor Details
-
KafkaInputFormatCombiner
public KafkaInputFormatCombiner()
-
-
Method Details
-
reduce
public void reduce(org.apache.hadoop.io.BytesWritable key, Iterator<org.apache.hadoop.io.BytesWritable> values, org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.BytesWritable, org.apache.hadoop.io.BytesWritable> output, org.apache.hadoop.mapred.Reporter reporter) throws IOException- Specified by:
reducein interfaceorg.apache.hadoop.mapred.Reducer<org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable, org.apache.hadoop.io.BytesWritable, org.apache.hadoop.io.BytesWritable> - Throws:
IOException
-
close
public void close()- Specified by:
closein interfaceAutoCloseable- Specified by:
closein interfaceCloseable
-
configureTask
Description copied from class:AbstractDataWriterTaskAllow implementations of this class to configure task-specific stuff.- Specified by:
configureTaskin classAbstractDataWriterTask- Parameters:
props- the job props that the task was configured with.
-
configure
public void configure(org.apache.hadoop.mapred.JobConf job) - Specified by:
configurein interfaceorg.apache.hadoop.mapred.JobConfigurable
-