Class ValidateSchemaAndBuildDictOutputFormat

  • All Implemented Interfaces:
    org.apache.hadoop.mapred.OutputFormat

    public class ValidateSchemaAndBuildDictOutputFormat
    extends org.apache.avro.mapred.AvroOutputFormat
    This class provides a way to: 1. Reuse the existing output directory and override existing files which throws an exception in the parent class: to keep the outfile path/Name deterministic 2. set custom permissions to the output directory/files to allow only the push job owners can access the personally identifiable information (eg: compressionDictionary) 3. sets FileOutputFormat.setOutputPath(org.apache.hadoop.mapred.JobConf, org.apache.hadoop.fs.Path)
    • Nested Class Summary

      • Nested classes/interfaces inherited from class org.apache.hadoop.mapred.FileOutputFormat

        org.apache.hadoop.mapred.FileOutputFormat.Counter
    • Field Summary

      • Fields inherited from class org.apache.avro.mapred.AvroOutputFormat

        DEFLATE_LEVEL_KEY, EXT, SYNC_INTERVAL_KEY, XZ_LEVEL_KEY, ZSTD_BUFFERPOOL_KEY, ZSTD_LEVEL_KEY
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      void checkOutputSpecs​(org.apache.hadoop.fs.FileSystem ignored, org.apache.hadoop.mapred.JobConf job)  
      org.apache.hadoop.mapred.RecordWriter getRecordWriter​(org.apache.hadoop.fs.FileSystem ignore, org.apache.hadoop.mapred.JobConf job, java.lang.String name, org.apache.hadoop.util.Progressable prog)
      Modify the output file name to be the MR job id to keep it unique.
      protected static void setValidateSchemaAndBuildDictionaryOutputDirPath​(org.apache.hadoop.mapred.JobConf job)
      1.
      • Methods inherited from class org.apache.avro.mapred.AvroOutputFormat

        setDeflateLevel, setSyncInterval
      • Methods inherited from class org.apache.hadoop.mapred.FileOutputFormat

        getCompressOutput, getOutputCompressorClass, getOutputPath, getPathForCustomFile, getTaskOutputPath, getUniqueName, getWorkOutputPath, setCompressOutput, setOutputCompressorClass, setOutputPath, setWorkOutputPath
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • ValidateSchemaAndBuildDictOutputFormat

        public ValidateSchemaAndBuildDictOutputFormat()
    • Method Detail

      • setValidateSchemaAndBuildDictionaryOutputDirPath

        protected static void setValidateSchemaAndBuildDictionaryOutputDirPath​(org.apache.hadoop.mapred.JobConf job)
                                                                        throws java.io.IOException
        1. The parent directory should be accessible by every user/group (777) 2. unique sub-directory for this VPJ should be accessible only by the user who triggers it (700) to protect unauthorized access to pii (eg: Zstd compression dictionary)
        Parameters:
        job - mapred config
        Throws:
        java.io.IOException
      • checkOutputSpecs

        public void checkOutputSpecs​(org.apache.hadoop.fs.FileSystem ignored,
                                     org.apache.hadoop.mapred.JobConf job)
                              throws java.io.IOException
        Specified by:
        checkOutputSpecs in interface org.apache.hadoop.mapred.OutputFormat
        Overrides:
        checkOutputSpecs in class org.apache.hadoop.mapred.FileOutputFormat
        Throws:
        java.io.IOException
      • getRecordWriter

        public org.apache.hadoop.mapred.RecordWriter getRecordWriter​(org.apache.hadoop.fs.FileSystem ignore,
                                                                     org.apache.hadoop.mapred.JobConf job,
                                                                     java.lang.String name,
                                                                     org.apache.hadoop.util.Progressable prog)
                                                              throws java.io.IOException
        Modify the output file name to be the MR job id to keep it unique. No need to explicitly control the permissions for the output file as its parent folder is restricted anyway.
        Specified by:
        getRecordWriter in interface org.apache.hadoop.mapred.OutputFormat
        Overrides:
        getRecordWriter in class org.apache.avro.mapred.AvroOutputFormat
        Throws:
        java.io.IOException