Class ValidateSchemaAndBuildDictMapper

  • All Implemented Interfaces:
    java.io.Closeable, java.lang.AutoCloseable, org.apache.hadoop.io.Closeable, org.apache.hadoop.mapred.JobConfigurable, org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.IntWritable,​org.apache.hadoop.io.NullWritable,​org.apache.avro.mapred.AvroWrapper<org.apache.avro.specific.SpecificRecord>,​org.apache.hadoop.io.NullWritable>

    public class ValidateSchemaAndBuildDictMapper
    extends AbstractDataWriterTask
    implements org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.IntWritable,​org.apache.hadoop.io.NullWritable,​org.apache.avro.mapred.AvroWrapper<org.apache.avro.specific.SpecificRecord>,​org.apache.hadoop.io.NullWritable>, org.apache.hadoop.mapred.JobConfigurable
    Mapper only MR to Validate Schema, Build compression dictionary if needed and persist some data (total file size and compression dictionary) in HDFS to be used by the VPJ Driver Note: processing all the files in this split are done sequentially and if it results in significant increase in the mapper time or resulting in timeouts, this needs to be revisited to be done via a thread pool. TODO: Ideally, this class should not extend AbstractDataWriterTask. Soon, this MR job is going to be removed and the handling will be moved to the VPJ driver.
    • Field Detail

      • isZstdDictCreationRequired

        protected boolean isZstdDictCreationRequired
      • hasReportedFailure

        protected boolean hasReportedFailure
      • inputDirectory

        protected java.lang.String inputDirectory
      • inputFileDataSize

        protected java.lang.Long inputFileDataSize
    • Constructor Detail

      • ValidateSchemaAndBuildDictMapper

        public ValidateSchemaAndBuildDictMapper()
    • Method Detail

      • map

        public void map​(org.apache.hadoop.io.IntWritable inputKey,
                        org.apache.hadoop.io.NullWritable inputValue,
                        org.apache.hadoop.mapred.OutputCollector<org.apache.avro.mapred.AvroWrapper<org.apache.avro.specific.SpecificRecord>,​org.apache.hadoop.io.NullWritable> output,
                        org.apache.hadoop.mapred.Reporter reporter)
                 throws java.io.IOException
        Specified by:
        map in interface org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.IntWritable,​org.apache.hadoop.io.NullWritable,​org.apache.avro.mapred.AvroWrapper<org.apache.avro.specific.SpecificRecord>,​org.apache.hadoop.io.NullWritable>
        Throws:
        java.io.IOException
      • processInput

        protected boolean processInput​(int fileIdx,
                                       org.apache.hadoop.mapred.Reporter reporter)
                                throws java.io.IOException
        1. validate this file's schema against the first file's schema 2. Calculates total file size 3. Collect sample for dictionary from this file if enabled
        Throws:
        java.io.IOException
      • checkLastModificationTimeAndLogError

        protected void checkLastModificationTimeAndLogError​(java.lang.Exception e,
                                                            java.lang.String errorString)
                                                     throws java.io.IOException
        Throws:
        java.io.IOException
      • checkLastModificationTimeAndLogError

        protected void checkLastModificationTimeAndLogError​(java.lang.String errorString,
                                                            org.apache.hadoop.mapred.Reporter reporter)
                                                     throws java.io.IOException
        Throws:
        java.io.IOException
      • checkLastModificationTimeAndLogError

        protected void checkLastModificationTimeAndLogError​(java.lang.Exception exception,
                                                            java.lang.String errorString,
                                                            org.apache.hadoop.mapred.Reporter reporter)
                                                     throws java.io.IOException
        Throws:
        java.io.IOException
      • initInputData

        protected void initInputData​(VeniceProperties props)
                              throws java.lang.Exception
        Throws:
        java.lang.Exception
      • configure

        public void configure​(org.apache.hadoop.mapred.JobConf job)
        Specified by:
        configure in interface org.apache.hadoop.mapred.JobConfigurable
      • close

        public void close()
        Specified by:
        close in interface java.lang.AutoCloseable
        Specified by:
        close in interface java.io.Closeable