Package com.linkedin.venice.hadoop
Class ValidateSchemaAndBuildDictOutputFormat
java.lang.Object
org.apache.hadoop.mapred.FileOutputFormat<org.apache.avro.mapred.AvroWrapper<T>,org.apache.hadoop.io.NullWritable>
org.apache.avro.mapred.AvroOutputFormat
com.linkedin.venice.hadoop.ValidateSchemaAndBuildDictOutputFormat
- All Implemented Interfaces:
org.apache.hadoop.mapred.OutputFormat
public class ValidateSchemaAndBuildDictOutputFormat
extends org.apache.avro.mapred.AvroOutputFormat
This class provides a way to:
1. Reuse the existing output directory and override existing files which throws an exception in
the parent class: to keep the outfile path/Name deterministic
2. set custom permissions to the output directory/files to allow only the push job owners can
access the personally identifiable information (eg: compressionDictionary)
3. sets
FileOutputFormat.setOutputPath(org.apache.hadoop.mapred.JobConf, org.apache.hadoop.fs.Path)
-
Nested Class Summary
Nested classes/interfaces inherited from class org.apache.hadoop.mapred.FileOutputFormat
org.apache.hadoop.mapred.FileOutputFormat.Counter
-
Field Summary
Fields inherited from class org.apache.avro.mapred.AvroOutputFormat
DEFLATE_LEVEL_KEY, EXT, SYNC_INTERVAL_KEY, XZ_LEVEL_KEY, ZSTD_BUFFERPOOL_KEY, ZSTD_LEVEL_KEY
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionvoid
checkOutputSpecs
(org.apache.hadoop.fs.FileSystem ignored, org.apache.hadoop.mapred.JobConf job) org.apache.hadoop.mapred.RecordWriter
getRecordWriter
(org.apache.hadoop.fs.FileSystem ignore, org.apache.hadoop.mapred.JobConf job, String name, org.apache.hadoop.util.Progressable prog) Modify the output file name to be the MR job id to keep it unique.protected static void
setValidateSchemaAndBuildDictionaryOutputDirPath
(org.apache.hadoop.mapred.JobConf job) 1.Methods inherited from class org.apache.avro.mapred.AvroOutputFormat
setDeflateLevel, setSyncInterval
Methods inherited from class org.apache.hadoop.mapred.FileOutputFormat
getCompressOutput, getOutputCompressorClass, getOutputPath, getPathForCustomFile, getTaskOutputPath, getUniqueName, getWorkOutputPath, setCompressOutput, setOutputCompressorClass, setOutputPath, setWorkOutputPath
-
Constructor Details
-
ValidateSchemaAndBuildDictOutputFormat
public ValidateSchemaAndBuildDictOutputFormat()
-
-
Method Details
-
setValidateSchemaAndBuildDictionaryOutputDirPath
protected static void setValidateSchemaAndBuildDictionaryOutputDirPath(org.apache.hadoop.mapred.JobConf job) 1. The parent directory should be accessible by every user/group (777) 2. unique sub-directory for this VPJ should be accessible only by the user who triggers it (700) to protect unauthorized access to pii (eg: Zstd compression dictionary)- Parameters:
job
- mapred config- Throws:
IOException
-
checkOutputSpecs
public void checkOutputSpecs(org.apache.hadoop.fs.FileSystem ignored, org.apache.hadoop.mapred.JobConf job) throws IOException - Specified by:
checkOutputSpecs
in interfaceorg.apache.hadoop.mapred.OutputFormat
- Overrides:
checkOutputSpecs
in classorg.apache.hadoop.mapred.FileOutputFormat
- Throws:
IOException
-
getRecordWriter
public org.apache.hadoop.mapred.RecordWriter getRecordWriter(org.apache.hadoop.fs.FileSystem ignore, org.apache.hadoop.mapred.JobConf job, String name, org.apache.hadoop.util.Progressable prog) throws IOException Modify the output file name to be the MR job id to keep it unique. No need to explicitly control the permissions for the output file as its parent folder is restricted anyway.- Specified by:
getRecordWriter
in interfaceorg.apache.hadoop.mapred.OutputFormat
- Overrides:
getRecordWriter
in classorg.apache.avro.mapred.AvroOutputFormat
- Throws:
IOException
-