Class KafkaInputFormat
java.lang.Object
com.linkedin.venice.hadoop.input.kafka.KafkaInputFormat
- All Implemented Interfaces:
org.apache.hadoop.mapred.InputFormat<KafkaInputMapperKey,
KafkaInputMapperValue>
public class KafkaInputFormat
extends Object
implements org.apache.hadoop.mapred.InputFormat<KafkaInputMapperKey,KafkaInputMapperValue>
We borrowed some idea from the open-sourced attic-crunch lib:
https://github.com/apache/attic-crunch/blob/master/crunch-kafka/src/main/java/org/apache/crunch/kafka/record/KafkaInputFormat.java
This
InputFormat
implementation is used to read data off a Kafka topic.-
Field Summary
Modifier and TypeFieldDescriptionstatic final long
The default max records per mapper, and if there are more records in one topic partition, it will be consumed by multiple mappers in parallel. -
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptiongetLatestOffsets
(org.apache.hadoop.mapred.JobConf config) org.apache.hadoop.mapred.RecordReader<KafkaInputMapperKey,
KafkaInputMapperValue> getRecordReader
(org.apache.hadoop.mapred.InputSplit split, org.apache.hadoop.mapred.JobConf job, org.apache.hadoop.mapred.Reporter reporter) org.apache.hadoop.mapred.RecordReader<KafkaInputMapperKey,
KafkaInputMapperValue> getRecordReader
(org.apache.hadoop.mapred.InputSplit split, org.apache.hadoop.mapred.JobConf job, org.apache.hadoop.mapred.Reporter reporter, PubSubConsumerAdapter consumer) org.apache.hadoop.mapred.InputSplit[]
getSplits
(org.apache.hadoop.mapred.JobConf job, int numSplits) Split the topic according to the topic partition size and the allowed max record per mapper.org.apache.hadoop.mapred.InputSplit[]
getSplitsByRecordsPerSplit
(org.apache.hadoop.mapred.JobConf job, long maxRecordsPerSplit)
-
Field Details
-
DEFAULT_KAFKA_INPUT_MAX_RECORDS_PER_MAPPER
public static final long DEFAULT_KAFKA_INPUT_MAX_RECORDS_PER_MAPPERThe default max records per mapper, and if there are more records in one topic partition, it will be consumed by multiple mappers in parallel. BTW, this calculation is not accurate since it is purely based on offset, and the topic being consumed could have log compaction enabled.- See Also:
-
-
Constructor Details
-
KafkaInputFormat
public KafkaInputFormat()
-
-
Method Details
-
getLatestOffsets
-
getSplits
public org.apache.hadoop.mapred.InputSplit[] getSplits(org.apache.hadoop.mapred.JobConf job, int numSplits) throws IOException Split the topic according to the topic partition size and the allowed max record per mapper. is not being used in this function.- Specified by:
getSplits
in interfaceorg.apache.hadoop.mapred.InputFormat<KafkaInputMapperKey,
KafkaInputMapperValue> - Throws:
IOException
-
getSplitsByRecordsPerSplit
public org.apache.hadoop.mapred.InputSplit[] getSplitsByRecordsPerSplit(org.apache.hadoop.mapred.JobConf job, long maxRecordsPerSplit) -
getRecordReader
public org.apache.hadoop.mapred.RecordReader<KafkaInputMapperKey,KafkaInputMapperValue> getRecordReader(org.apache.hadoop.mapred.InputSplit split, org.apache.hadoop.mapred.JobConf job, org.apache.hadoop.mapred.Reporter reporter) - Specified by:
getRecordReader
in interfaceorg.apache.hadoop.mapred.InputFormat<KafkaInputMapperKey,
KafkaInputMapperValue>
-
getRecordReader
public org.apache.hadoop.mapred.RecordReader<KafkaInputMapperKey,KafkaInputMapperValue> getRecordReader(org.apache.hadoop.mapred.InputSplit split, org.apache.hadoop.mapred.JobConf job, org.apache.hadoop.mapred.Reporter reporter, PubSubConsumerAdapter consumer)
-