org.apache.hadoop.mapred.FileOutputFormat<org.apache.avro.mapred.AvroWrapper<T>,org.apache.hadoop.io.NullWritable>

org.apache.avro.mapred.AvroOutputFormat

com.linkedin.venice.hadoop.ValidateSchemaAndBuildDictOutputFormat

All Implemented Interfaces:: org.apache.hadoop.mapred.OutputFormat

public class ValidateSchemaAndBuildDictOutputFormat extends org.apache.avro.mapred.AvroOutputFormat

This class provides a way to: 1. Reuse the existing output directory and override existing files which throws an exception in the parent class: to keep the outfile path/Name deterministic 2. set custom permissions to the output directory/files to allow only the push job owners can access the personally identifiable information (eg: compressionDictionary) 3. sets FileOutputFormat.setOutputPath(org.apache.hadoop.mapred.JobConf, org.apache.hadoop.fs.Path)

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.hadoop.mapred.FileOutputFormat
org.apache.hadoop.mapred.FileOutputFormat.Counter
Field Summary

Fields inherited from class org.apache.avro.mapred.AvroOutputFormat
DEFLATE_LEVEL_KEY, EXT, SYNC_INTERVAL_KEY, XZ_LEVEL_KEY, ZSTD_BUFFERPOOL_KEY, ZSTD_LEVEL_KEY
Constructor Summary

Constructors

Constructor

Description

ValidateSchemaAndBuildDictOutputFormat()
Method Summary

Modifier and Type

Method

Description

void

checkOutputSpecs(org.apache.hadoop.fs.FileSystem ignored, org.apache.hadoop.mapred.JobConf job)

org.apache.hadoop.mapred.RecordWriter

getRecordWriter(org.apache.hadoop.fs.FileSystem ignore, org.apache.hadoop.mapred.JobConf job, String name, org.apache.hadoop.util.Progressable prog)

Modify the output file name to be the MR job id to keep it unique.

protected static void

setValidateSchemaAndBuildDictionaryOutputDirPath(org.apache.hadoop.mapred.JobConf job)

1.

Methods inherited from class org.apache.avro.mapred.AvroOutputFormat
setDeflateLevel, setSyncInterval

Methods inherited from class org.apache.hadoop.mapred.FileOutputFormat
getCompressOutput, getOutputCompressorClass, getOutputPath, getPathForCustomFile, getTaskOutputPath, getUniqueName, getWorkOutputPath, setCompressOutput, setOutputCompressorClass, setOutputPath, setWorkOutputPath

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- ValidateSchemaAndBuildDictOutputFormat
  
  public ValidateSchemaAndBuildDictOutputFormat()
Method Details
- setValidateSchemaAndBuildDictionaryOutputDirPath
  
  protected static void setValidateSchemaAndBuildDictionaryOutputDirPath(org.apache.hadoop.mapred.JobConf job)
  
  1. The parent directory should be accessible by every user/group (777) 2. unique sub-directory for this VPJ should be accessible only by the user who triggers it (700) to protect unauthorized access to pii (eg: Zstd compression dictionary)
  
  Parameters:
  
  job - mapred config
  
  Throws:
  
  IOException
- checkOutputSpecs
  
  public void checkOutputSpecs(org.apache.hadoop.fs.FileSystem ignored, org.apache.hadoop.mapred.JobConf job) throws IOException
  
  Specified by:
  
  checkOutputSpecs in interface org.apache.hadoop.mapred.OutputFormat
  
  Overrides:
  
  checkOutputSpecs in class org.apache.hadoop.mapred.FileOutputFormat
  
  Throws:
  
  IOException
- getRecordWriter
  
  public org.apache.hadoop.mapred.RecordWriter getRecordWriter(org.apache.hadoop.fs.FileSystem ignore, org.apache.hadoop.mapred.JobConf job, String name, org.apache.hadoop.util.Progressable prog) throws IOException
  
  Modify the output file name to be the MR job id to keep it unique. No need to explicitly control the permissions for the output file as its parent folder is restricted anyway.
  
  Specified by:
  
  getRecordWriter in interface org.apache.hadoop.mapred.OutputFormat
  
  Overrides:
  
  getRecordWriter in class org.apache.avro.mapred.AvroOutputFormat
  
  Throws:
  
  IOException

Class ValidateSchemaAndBuildDictOutputFormat

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.hadoop.mapred.FileOutputFormat

Field Summary

Fields inherited from class org.apache.avro.mapred.AvroOutputFormat

Constructor Summary

Method Summary

Methods inherited from class org.apache.avro.mapred.AvroOutputFormat

Methods inherited from class org.apache.hadoop.mapred.FileOutputFormat

Methods inherited from class java.lang.Object

Constructor Details

ValidateSchemaAndBuildDictOutputFormat

Method Details

setValidateSchemaAndBuildDictionaryOutputDirPath

checkOutputSpecs

getRecordWriter