com.linkedin.venice.hadoop.task.datawriter.AbstractDataWriterTask

com.linkedin.venice.hadoop.ValidateSchemaAndBuildDictMapper

All Implemented Interfaces:: Closeable, AutoCloseable, org.apache.hadoop.io.Closeable, org.apache.hadoop.mapred.JobConfigurable, org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.IntWritable,org.apache.hadoop.io.NullWritable,org.apache.avro.mapred.AvroWrapper<org.apache.avro.specific.SpecificRecord>,org.apache.hadoop.io.NullWritable>

public class ValidateSchemaAndBuildDictMapper extends AbstractDataWriterTask implements org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.IntWritable,org.apache.hadoop.io.NullWritable,org.apache.avro.mapred.AvroWrapper<org.apache.avro.specific.SpecificRecord>,org.apache.hadoop.io.NullWritable>, org.apache.hadoop.mapred.JobConfigurable

Mapper only MR to Validate Schema, Build compression dictionary if needed and persist some data (total file size and compression dictionary) in HDFS to be used by the VPJ Driver Note: processing all the files in this split are done sequentially and if it results in significant increase in the mapper time or resulting in timeouts, this needs to be revisited to be done via a thread pool. TODO: Ideally, this class should not extend AbstractDataWriterTask. Soon, this MR job is going to be removed and the handling will be moved to the VPJ driver.

Field Summary

Fields

Modifier and Type

Field

Description

protected boolean

hasReportedFailure

protected InputDataInfoProvider.InputDataInfo

inputDataInfo

protected DefaultInputDataInfoProvider

inputDataInfoProvider

protected String

inputDirectory

protected Long

inputFileDataSize

protected boolean

isZstdDictCreationRequired

protected PushJobSetting

pushJobSetting

Fields inherited from class com.linkedin.venice.hadoop.task.datawriter.AbstractDataWriterTask
TASK_ID_NOT_SET
Constructor Summary

Constructors

Constructor

Description

ValidateSchemaAndBuildDictMapper()
Method Summary

Modifier and Type

Method

Description

protected void

checkLastModificationTimeAndLogError(Exception e, String errorString)

protected void

checkLastModificationTimeAndLogError(Exception exception, String errorString, org.apache.hadoop.mapred.Reporter reporter)

protected void

checkLastModificationTimeAndLogError(String errorString, org.apache.hadoop.mapred.Reporter reporter)

void

close()

void

configure(org.apache.hadoop.mapred.JobConf job)

protected void

configureTask(VeniceProperties props)

Allow implementations of this class to configure task-specific stuff.

protected void

initInputData(VeniceProperties props)

void

map(org.apache.hadoop.io.IntWritable inputKey, org.apache.hadoop.io.NullWritable inputValue, org.apache.hadoop.mapred.OutputCollector<org.apache.avro.mapred.AvroWrapper<org.apache.avro.specific.SpecificRecord>,org.apache.hadoop.io.NullWritable> output, org.apache.hadoop.mapred.Reporter reporter)

protected boolean

processInput(int fileIdx, org.apache.hadoop.mapred.Reporter reporter)

1.

Methods inherited from class com.linkedin.venice.hadoop.task.datawriter.AbstractDataWriterTask
configure, getEngineTaskConfigProvider, getPartitionCount, getTaskId, isChunkingEnabled, isRmdChunkingEnabled, setChunkingEnabled

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- inputDataInfoProvider
  
  protected DefaultInputDataInfoProvider inputDataInfoProvider
- inputDataInfo
  
  protected InputDataInfoProvider.InputDataInfo inputDataInfo
- pushJobSetting
  
  protected PushJobSetting pushJobSetting
- isZstdDictCreationRequired
  
  protected boolean isZstdDictCreationRequired
- hasReportedFailure
  
  protected boolean hasReportedFailure
- inputDirectory
  
  protected String inputDirectory
- inputFileDataSize
  
  protected Long inputFileDataSize
Constructor Details
- ValidateSchemaAndBuildDictMapper
  
  public ValidateSchemaAndBuildDictMapper()
Method Details
- map
  
  public void map(org.apache.hadoop.io.IntWritable inputKey, org.apache.hadoop.io.NullWritable inputValue, org.apache.hadoop.mapred.OutputCollector<org.apache.avro.mapred.AvroWrapper<org.apache.avro.specific.SpecificRecord>,org.apache.hadoop.io.NullWritable> output, org.apache.hadoop.mapred.Reporter reporter) throws IOException
  
  Specified by:
  
  map in interface org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.IntWritable,org.apache.hadoop.io.NullWritable,org.apache.avro.mapred.AvroWrapper<org.apache.avro.specific.SpecificRecord>,org.apache.hadoop.io.NullWritable>
  
  Throws:
  
  IOException
- processInput
  
  protected boolean processInput(int fileIdx, org.apache.hadoop.mapred.Reporter reporter) throws IOException
  
  1. validate this file's schema against the first file's schema 2. Calculates total file size 3. Collect sample for dictionary from this file if enabled
  
  Throws:
  
  IOException
- checkLastModificationTimeAndLogError
  
  protected void checkLastModificationTimeAndLogError(Exception e, String errorString) throws IOException
  
  Throws:
  
  IOException
- checkLastModificationTimeAndLogError
  
  protected void checkLastModificationTimeAndLogError(String errorString, org.apache.hadoop.mapred.Reporter reporter) throws IOException
  
  Throws:
  
  IOException
- checkLastModificationTimeAndLogError
  
  protected void checkLastModificationTimeAndLogError(Exception exception, String errorString, org.apache.hadoop.mapred.Reporter reporter) throws IOException
  
  Throws:
  
  IOException
- initInputData
  
  protected void initInputData(VeniceProperties props) throws Exception
  
  Throws:
  
  Exception
- configureTask
  
  protected void configureTask(VeniceProperties props)
  
  Description copied from class: AbstractDataWriterTask
  
  Allow implementations of this class to configure task-specific stuff.
  
  Specified by:
  
  configureTask in class AbstractDataWriterTask
  
  Parameters:
  
  props - the job props that the task was configured with.
- configure
  
  public void configure(org.apache.hadoop.mapred.JobConf job)
  
  Specified by:
  
  configure in interface org.apache.hadoop.mapred.JobConfigurable
- close
  
  public void close()
  
  Specified by:
  
  close in interface AutoCloseable
  
  Specified by:
  
  close in interface Closeable

Class ValidateSchemaAndBuildDictMapper

Field Summary

Fields inherited from class com.linkedin.venice.hadoop.task.datawriter.AbstractDataWriterTask

Constructor Summary

Method Summary

Methods inherited from class com.linkedin.venice.hadoop.task.datawriter.AbstractDataWriterTask

Methods inherited from class java.lang.Object

Field Details

inputDataInfoProvider

inputDataInfo

pushJobSetting

isZstdDictCreationRequired

hasReportedFailure

inputDirectory

inputFileDataSize

Constructor Details

ValidateSchemaAndBuildDictMapper

Method Details

map

processInput

checkLastModificationTimeAndLogError

checkLastModificationTimeAndLogError

checkLastModificationTimeAndLogError

initInputData

configureTask

configure

close