Class KafkaInputFormat

java.lang.Object
com.linkedin.venice.hadoop.input.kafka.KafkaInputFormat
All Implemented Interfaces:
org.apache.hadoop.mapred.InputFormat<KafkaInputMapperKey,KafkaInputMapperValue>

public class KafkaInputFormat extends Object implements org.apache.hadoop.mapred.InputFormat<KafkaInputMapperKey,KafkaInputMapperValue>
We borrowed some idea from the open-sourced attic-crunch lib: https://github.com/apache/attic-crunch/blob/master/crunch-kafka/src/main/java/org/apache/crunch/kafka/record/KafkaInputFormat.java This InputFormat implementation is used to read data off a Kafka topic.
  • Field Details

    • DEFAULT_KAFKA_INPUT_MAX_RECORDS_PER_MAPPER

      public static final long DEFAULT_KAFKA_INPUT_MAX_RECORDS_PER_MAPPER
      The default max records per mapper, and if there are more records in one topic partition, it will be consumed by multiple mappers in parallel. BTW, this calculation is not accurate since it is purely based on offset, and the topic being consumed could have log compaction enabled.
      See Also:
  • Constructor Details

    • KafkaInputFormat

      public KafkaInputFormat()
  • Method Details

    • getLatestOffsets

      protected Map<org.apache.kafka.common.TopicPartition,Long> getLatestOffsets(org.apache.hadoop.mapred.JobConf config)
    • getSplits

      public org.apache.hadoop.mapred.InputSplit[] getSplits(org.apache.hadoop.mapred.JobConf job, int numSplits) throws IOException
      Split the topic according to the topic partition size and the allowed max record per mapper. is not being used in this function.
      Specified by:
      getSplits in interface org.apache.hadoop.mapred.InputFormat<KafkaInputMapperKey,KafkaInputMapperValue>
      Throws:
      IOException
    • getSplitsByRecordsPerSplit

      public org.apache.hadoop.mapred.InputSplit[] getSplitsByRecordsPerSplit(org.apache.hadoop.mapred.JobConf job, long maxRecordsPerSplit)
    • getRecordReader

      public org.apache.hadoop.mapred.RecordReader<KafkaInputMapperKey,KafkaInputMapperValue> getRecordReader(org.apache.hadoop.mapred.InputSplit split, org.apache.hadoop.mapred.JobConf job, org.apache.hadoop.mapred.Reporter reporter)
      Specified by:
      getRecordReader in interface org.apache.hadoop.mapred.InputFormat<KafkaInputMapperKey,KafkaInputMapperValue>
    • getRecordReader

      public org.apache.hadoop.mapred.RecordReader<KafkaInputMapperKey,KafkaInputMapperValue> getRecordReader(org.apache.hadoop.mapred.InputSplit split, org.apache.hadoop.mapred.JobConf job, org.apache.hadoop.mapred.Reporter reporter, PubSubConsumerAdapter consumer)