com.linkedin.venice.hadoop.input.kafka.KafkaInputFormat

All Implemented Interfaces:: org.apache.hadoop.mapred.InputFormat<KafkaInputMapperKey,KafkaInputMapperValue>

public class KafkaInputFormat extends Object implements org.apache.hadoop.mapred.InputFormat<KafkaInputMapperKey,KafkaInputMapperValue>

We borrowed some idea from the open-sourced attic-crunch lib: https://github.com/apache/attic-crunch/blob/master/crunch-kafka/src/main/java/org/apache/crunch/kafka/record/KafkaInputFormat.java This InputFormat implementation is used to read data off a Kafka topic.

Field Summary

Fields

Modifier and Type

Field

Description

static final long

DEFAULT_KAFKA_INPUT_MAX_RECORDS_PER_MAPPER

The default max records per mapper, and if there are more records in one topic partition, it will be consumed by multiple mappers in parallel.
Constructor Summary

Constructors

Constructor

Description

KafkaInputFormat()
Method Summary

Modifier and Type

Method

Description

protected Map<PubSubTopicPartition,Long>

getLatestOffsets(org.apache.hadoop.mapred.JobConf config)

org.apache.hadoop.mapred.RecordReader<KafkaInputMapperKey,KafkaInputMapperValue>

getRecordReader(org.apache.hadoop.mapred.InputSplit split, org.apache.hadoop.mapred.JobConf job, org.apache.hadoop.mapred.Reporter reporter)

org.apache.hadoop.mapred.RecordReader<KafkaInputMapperKey,KafkaInputMapperValue>

getRecordReader(org.apache.hadoop.mapred.InputSplit split, org.apache.hadoop.mapred.JobConf job, org.apache.hadoop.mapred.Reporter reporter, PubSubConsumerAdapter consumer)

org.apache.hadoop.mapred.InputSplit[]

getSplits(org.apache.hadoop.mapred.JobConf job, int numSplits)

Split the topic according to the topic partition size and the allowed max record per mapper.

org.apache.hadoop.mapred.InputSplit[]

getSplitsByRecordsPerSplit(org.apache.hadoop.mapred.JobConf job, long maxRecordsPerSplit)

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- DEFAULT_KAFKA_INPUT_MAX_RECORDS_PER_MAPPER
  
  public static final long DEFAULT_KAFKA_INPUT_MAX_RECORDS_PER_MAPPER
  
  The default max records per mapper, and if there are more records in one topic partition, it will be consumed by multiple mappers in parallel. BTW, this calculation is not accurate since it is purely based on offset, and the topic being consumed could have log compaction enabled.
  See Also:
  
  Constant Field Values
Constructor Details
- KafkaInputFormat
  
  public KafkaInputFormat()
Method Details
- getLatestOffsets
  
  protected Map<PubSubTopicPartition,Long> getLatestOffsets(org.apache.hadoop.mapred.JobConf config)
- getSplits
  
  public org.apache.hadoop.mapred.InputSplit[] getSplits(org.apache.hadoop.mapred.JobConf job, int numSplits) throws IOException
  
  Split the topic according to the topic partition size and the allowed max record per mapper. is not being used in this function.
  
  Specified by:
  
  getSplits in interface org.apache.hadoop.mapred.InputFormat<KafkaInputMapperKey,KafkaInputMapperValue>
  
  Throws:
  
  IOException
- getSplitsByRecordsPerSplit
  
  public org.apache.hadoop.mapred.InputSplit[] getSplitsByRecordsPerSplit(org.apache.hadoop.mapred.JobConf job, long maxRecordsPerSplit)
- getRecordReader
  
  public org.apache.hadoop.mapred.RecordReader<KafkaInputMapperKey,KafkaInputMapperValue> getRecordReader(org.apache.hadoop.mapred.InputSplit split, org.apache.hadoop.mapred.JobConf job, org.apache.hadoop.mapred.Reporter reporter)
  
  Specified by:
  
  getRecordReader in interface org.apache.hadoop.mapred.InputFormat<KafkaInputMapperKey,KafkaInputMapperValue>
- getRecordReader
  
  public org.apache.hadoop.mapred.RecordReader<KafkaInputMapperKey,KafkaInputMapperValue> getRecordReader(org.apache.hadoop.mapred.InputSplit split, org.apache.hadoop.mapred.JobConf job, org.apache.hadoop.mapred.Reporter reporter, PubSubConsumerAdapter consumer)

Class KafkaInputFormat

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

DEFAULT_KAFKA_INPUT_MAX_RECORDS_PER_MAPPER

Constructor Details

KafkaInputFormat

Method Details

getLatestOffsets

getSplits

getSplitsByRecordsPerSplit

getRecordReader

getRecordReader