Class KafkaInputFormat
- java.lang.Object
-
- com.linkedin.venice.hadoop.input.kafka.KafkaInputFormat
-
- All Implemented Interfaces:
org.apache.hadoop.mapred.InputFormat<KafkaInputMapperKey,KafkaInputMapperValue>
public class KafkaInputFormat extends java.lang.Object implements org.apache.hadoop.mapred.InputFormat<KafkaInputMapperKey,KafkaInputMapperValue>
We borrowed some idea from the open-sourced attic-crunch lib: https://github.com/apache/attic-crunch/blob/master/crunch-kafka/src/main/java/org/apache/crunch/kafka/record/KafkaInputFormat.java ThisInputFormat
implementation is used to read data off a Kafka topic.
-
-
Field Summary
Fields Modifier and Type Field Description static long
DEFAULT_KAFKA_INPUT_MAX_RECORDS_PER_MAPPER
The default max records per mapper, and if there are more records in one topic partition, it will be consumed by multiple mappers in parallel.
-
Constructor Summary
Constructors Constructor Description KafkaInputFormat()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected java.util.Map<org.apache.kafka.common.TopicPartition,java.lang.Long>
getLatestOffsets(org.apache.hadoop.mapred.JobConf config)
org.apache.hadoop.mapred.RecordReader<KafkaInputMapperKey,KafkaInputMapperValue>
getRecordReader(org.apache.hadoop.mapred.InputSplit split, org.apache.hadoop.mapred.JobConf job, org.apache.hadoop.mapred.Reporter reporter)
org.apache.hadoop.mapred.RecordReader<KafkaInputMapperKey,KafkaInputMapperValue>
getRecordReader(org.apache.hadoop.mapred.InputSplit split, org.apache.hadoop.mapred.JobConf job, org.apache.hadoop.mapred.Reporter reporter, PubSubConsumerAdapter consumer)
org.apache.hadoop.mapred.InputSplit[]
getSplits(org.apache.hadoop.mapred.JobConf job, int numSplits)
Split the topic according to the topic partition size and the allowed max record per mapper.org.apache.hadoop.mapred.InputSplit[]
getSplitsByRecordsPerSplit(org.apache.hadoop.mapred.JobConf job, long maxRecordsPerSplit)
-
-
-
Field Detail
-
DEFAULT_KAFKA_INPUT_MAX_RECORDS_PER_MAPPER
public static final long DEFAULT_KAFKA_INPUT_MAX_RECORDS_PER_MAPPER
The default max records per mapper, and if there are more records in one topic partition, it will be consumed by multiple mappers in parallel. BTW, this calculation is not accurate since it is purely based on offset, and the topic being consumed could have log compaction enabled.- See Also:
- Constant Field Values
-
-
Method Detail
-
getLatestOffsets
protected java.util.Map<org.apache.kafka.common.TopicPartition,java.lang.Long> getLatestOffsets(org.apache.hadoop.mapred.JobConf config)
-
getSplits
public org.apache.hadoop.mapred.InputSplit[] getSplits(org.apache.hadoop.mapred.JobConf job, int numSplits) throws java.io.IOException
Split the topic according to the topic partition size and the allowed max record per mapper. {@param numSplits} is not being used in this function.- Specified by:
getSplits
in interfaceorg.apache.hadoop.mapred.InputFormat<KafkaInputMapperKey,KafkaInputMapperValue>
- Throws:
java.io.IOException
-
getSplitsByRecordsPerSplit
public org.apache.hadoop.mapred.InputSplit[] getSplitsByRecordsPerSplit(org.apache.hadoop.mapred.JobConf job, long maxRecordsPerSplit)
-
getRecordReader
public org.apache.hadoop.mapred.RecordReader<KafkaInputMapperKey,KafkaInputMapperValue> getRecordReader(org.apache.hadoop.mapred.InputSplit split, org.apache.hadoop.mapred.JobConf job, org.apache.hadoop.mapred.Reporter reporter)
- Specified by:
getRecordReader
in interfaceorg.apache.hadoop.mapred.InputFormat<KafkaInputMapperKey,KafkaInputMapperValue>
-
getRecordReader
public org.apache.hadoop.mapred.RecordReader<KafkaInputMapperKey,KafkaInputMapperValue> getRecordReader(org.apache.hadoop.mapred.InputSplit split, org.apache.hadoop.mapred.JobConf job, org.apache.hadoop.mapred.Reporter reporter, PubSubConsumerAdapter consumer)
-
-