Class KafkaInputFormatCombiner
java.lang.Object
com.linkedin.venice.hadoop.task.datawriter.AbstractDataWriterTask
com.linkedin.venice.hadoop.input.kafka.KafkaInputFormatCombiner
- All Implemented Interfaces:
Closeable
,AutoCloseable
,org.apache.hadoop.io.Closeable
,org.apache.hadoop.mapred.JobConfigurable
,org.apache.hadoop.mapred.Reducer<org.apache.hadoop.io.BytesWritable,
org.apache.hadoop.io.BytesWritable, org.apache.hadoop.io.BytesWritable, org.apache.hadoop.io.BytesWritable>
public class KafkaInputFormatCombiner
extends AbstractDataWriterTask
implements org.apache.hadoop.mapred.Reducer<org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable>, org.apache.hadoop.mapred.JobConfigurable
This class is a Combiner, which is a functionality of the MR framework where we can plug a
Reducer
implementation to be executed within the Mapper
task, on its output. This allows the Reducer to have less
work to do since part of it was already done in the Mappers. In the case of the KafkaInputFormat
, we cannot
do all the same work that the VeniceKafkaInputReducer
is doing, since that includes producing to Kafka, but
the part which is relevant to shift to Mappers is the compaction. We have observed that when the input partitions are
very large, then Reducers can run out of memory. By shifting the work to Mappers, we should be able to remove this
bottleneck and scale better, especially if used in combination with a low value for:
VenicePushJobConstants.KAFKA_INPUT_MAX_RECORDS_PER_MAPPER
-
Field Summary
Fields inherited from class com.linkedin.venice.hadoop.task.datawriter.AbstractDataWriterTask
TASK_ID_NOT_SET
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionvoid
close()
void
configure
(org.apache.hadoop.mapred.JobConf job) protected void
configureTask
(VeniceProperties props) Allow implementations of this class to configure task-specific stuff.void
reduce
(org.apache.hadoop.io.BytesWritable key, Iterator<org.apache.hadoop.io.BytesWritable> values, org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.BytesWritable, org.apache.hadoop.io.BytesWritable> output, org.apache.hadoop.mapred.Reporter reporter) Methods inherited from class com.linkedin.venice.hadoop.task.datawriter.AbstractDataWriterTask
configure, getEngineTaskConfigProvider, getPartitionCount, getTaskId, isChunkingEnabled, isRmdChunkingEnabled, setChunkingEnabled
-
Constructor Details
-
KafkaInputFormatCombiner
public KafkaInputFormatCombiner()
-
-
Method Details
-
reduce
public void reduce(org.apache.hadoop.io.BytesWritable key, Iterator<org.apache.hadoop.io.BytesWritable> values, org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.BytesWritable, org.apache.hadoop.io.BytesWritable> output, org.apache.hadoop.mapred.Reporter reporter) throws IOException- Specified by:
reduce
in interfaceorg.apache.hadoop.mapred.Reducer<org.apache.hadoop.io.BytesWritable,
org.apache.hadoop.io.BytesWritable, org.apache.hadoop.io.BytesWritable, org.apache.hadoop.io.BytesWritable> - Throws:
IOException
-
close
public void close()- Specified by:
close
in interfaceAutoCloseable
- Specified by:
close
in interfaceCloseable
-
configureTask
Description copied from class:AbstractDataWriterTask
Allow implementations of this class to configure task-specific stuff.- Specified by:
configureTask
in classAbstractDataWriterTask
- Parameters:
props
- the job props that the task was configured with.
-
configure
public void configure(org.apache.hadoop.mapred.JobConf job) - Specified by:
configure
in interfaceorg.apache.hadoop.mapred.JobConfigurable
-