Class ValidateSchemaAndBuildDictMapper

java.lang.Object
com.linkedin.venice.hadoop.task.datawriter.AbstractDataWriterTask
com.linkedin.venice.hadoop.ValidateSchemaAndBuildDictMapper
All Implemented Interfaces:
Closeable, AutoCloseable, org.apache.hadoop.io.Closeable, org.apache.hadoop.mapred.JobConfigurable, org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.IntWritable,org.apache.hadoop.io.NullWritable,org.apache.avro.mapred.AvroWrapper<org.apache.avro.specific.SpecificRecord>,org.apache.hadoop.io.NullWritable>

public class ValidateSchemaAndBuildDictMapper extends AbstractDataWriterTask implements org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.IntWritable,org.apache.hadoop.io.NullWritable,org.apache.avro.mapred.AvroWrapper<org.apache.avro.specific.SpecificRecord>,org.apache.hadoop.io.NullWritable>, org.apache.hadoop.mapred.JobConfigurable
Mapper only MR to Validate Schema, Build compression dictionary if needed and persist some data (total file size and compression dictionary) in HDFS to be used by the VPJ Driver Note: processing all the files in this split are done sequentially and if it results in significant increase in the mapper time or resulting in timeouts, this needs to be revisited to be done via a thread pool. TODO: Ideally, this class should not extend AbstractDataWriterTask. Soon, this MR job is going to be removed and the handling will be moved to the VPJ driver.
  • Field Details

    • inputDataInfoProvider

      protected DefaultInputDataInfoProvider inputDataInfoProvider
    • inputDataInfo

      protected InputDataInfoProvider.InputDataInfo inputDataInfo
    • pushJobSetting

      protected PushJobSetting pushJobSetting
    • isZstdDictCreationRequired

      protected boolean isZstdDictCreationRequired
    • hasReportedFailure

      protected boolean hasReportedFailure
    • inputDirectory

      protected String inputDirectory
    • inputFileDataSize

      protected Long inputFileDataSize
  • Constructor Details

    • ValidateSchemaAndBuildDictMapper

      public ValidateSchemaAndBuildDictMapper()
  • Method Details

    • map

      public void map(org.apache.hadoop.io.IntWritable inputKey, org.apache.hadoop.io.NullWritable inputValue, org.apache.hadoop.mapred.OutputCollector<org.apache.avro.mapred.AvroWrapper<org.apache.avro.specific.SpecificRecord>,org.apache.hadoop.io.NullWritable> output, org.apache.hadoop.mapred.Reporter reporter) throws IOException
      Specified by:
      map in interface org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.IntWritable,org.apache.hadoop.io.NullWritable,org.apache.avro.mapred.AvroWrapper<org.apache.avro.specific.SpecificRecord>,org.apache.hadoop.io.NullWritable>
      Throws:
      IOException
    • processInput

      protected boolean processInput(int fileIdx, org.apache.hadoop.mapred.Reporter reporter) throws IOException
      1. validate this file's schema against the first file's schema 2. Calculates total file size 3. Collect sample for dictionary from this file if enabled
      Throws:
      IOException
    • checkLastModificationTimeAndLogError

      protected void checkLastModificationTimeAndLogError(Exception e, String errorString) throws IOException
      Throws:
      IOException
    • checkLastModificationTimeAndLogError

      protected void checkLastModificationTimeAndLogError(String errorString, org.apache.hadoop.mapred.Reporter reporter) throws IOException
      Throws:
      IOException
    • checkLastModificationTimeAndLogError

      protected void checkLastModificationTimeAndLogError(Exception exception, String errorString, org.apache.hadoop.mapred.Reporter reporter) throws IOException
      Throws:
      IOException
    • initInputData

      protected void initInputData(VeniceProperties props) throws Exception
      Throws:
      Exception
    • configureTask

      protected void configureTask(VeniceProperties props)
      Description copied from class: AbstractDataWriterTask
      Allow implementations of this class to configure task-specific stuff.
      Specified by:
      configureTask in class AbstractDataWriterTask
      Parameters:
      props - the job props that the task was configured with.
    • configure

      public void configure(org.apache.hadoop.mapred.JobConf job)
      Specified by:
      configure in interface org.apache.hadoop.mapred.JobConfigurable
    • close

      public void close()
      Specified by:
      close in interface AutoCloseable
      Specified by:
      close in interface Closeable