Class KafkaInputFormat

  • All Implemented Interfaces:
    org.apache.hadoop.mapred.InputFormat<KafkaInputMapperKey,​KafkaInputMapperValue>

    public class KafkaInputFormat
    extends java.lang.Object
    implements org.apache.hadoop.mapred.InputFormat<KafkaInputMapperKey,​KafkaInputMapperValue>
    We borrowed some idea from the open-sourced attic-crunch lib: https://github.com/apache/attic-crunch/blob/master/crunch-kafka/src/main/java/org/apache/crunch/kafka/record/KafkaInputFormat.java This InputFormat implementation is used to read data off a Kafka topic.
    • Field Summary

      Fields 
      Modifier and Type Field Description
      static long DEFAULT_KAFKA_INPUT_MAX_RECORDS_PER_MAPPER
      The default max records per mapper, and if there are more records in one topic partition, it will be consumed by multiple mappers in parallel.
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      protected java.util.Map<org.apache.kafka.common.TopicPartition,​java.lang.Long> getLatestOffsets​(org.apache.hadoop.mapred.JobConf config)  
      org.apache.hadoop.mapred.RecordReader<KafkaInputMapperKey,​KafkaInputMapperValue> getRecordReader​(org.apache.hadoop.mapred.InputSplit split, org.apache.hadoop.mapred.JobConf job, org.apache.hadoop.mapred.Reporter reporter)  
      org.apache.hadoop.mapred.RecordReader<KafkaInputMapperKey,​KafkaInputMapperValue> getRecordReader​(org.apache.hadoop.mapred.InputSplit split, org.apache.hadoop.mapred.JobConf job, org.apache.hadoop.mapred.Reporter reporter, PubSubConsumerAdapter consumer)  
      org.apache.hadoop.mapred.InputSplit[] getSplits​(org.apache.hadoop.mapred.JobConf job, int numSplits)
      Split the topic according to the topic partition size and the allowed max record per mapper.
      org.apache.hadoop.mapred.InputSplit[] getSplitsByRecordsPerSplit​(org.apache.hadoop.mapred.JobConf job, long maxRecordsPerSplit)  
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • DEFAULT_KAFKA_INPUT_MAX_RECORDS_PER_MAPPER

        public static final long DEFAULT_KAFKA_INPUT_MAX_RECORDS_PER_MAPPER
        The default max records per mapper, and if there are more records in one topic partition, it will be consumed by multiple mappers in parallel. BTW, this calculation is not accurate since it is purely based on offset, and the topic being consumed could have log compaction enabled.
        See Also:
        Constant Field Values
    • Constructor Detail

      • KafkaInputFormat

        public KafkaInputFormat()
    • Method Detail

      • getLatestOffsets

        protected java.util.Map<org.apache.kafka.common.TopicPartition,​java.lang.Long> getLatestOffsets​(org.apache.hadoop.mapred.JobConf config)
      • getSplits

        public org.apache.hadoop.mapred.InputSplit[] getSplits​(org.apache.hadoop.mapred.JobConf job,
                                                               int numSplits)
                                                        throws java.io.IOException
        Split the topic according to the topic partition size and the allowed max record per mapper. {@param numSplits} is not being used in this function.
        Specified by:
        getSplits in interface org.apache.hadoop.mapred.InputFormat<KafkaInputMapperKey,​KafkaInputMapperValue>
        Throws:
        java.io.IOException
      • getSplitsByRecordsPerSplit

        public org.apache.hadoop.mapred.InputSplit[] getSplitsByRecordsPerSplit​(org.apache.hadoop.mapred.JobConf job,
                                                                                long maxRecordsPerSplit)