Package com.linkedin.venice.hadoop
Class ValidateSchemaAndBuildDictMapper
java.lang.Object
com.linkedin.venice.hadoop.task.datawriter.AbstractDataWriterTask
com.linkedin.venice.hadoop.ValidateSchemaAndBuildDictMapper
- All Implemented Interfaces:
Closeable
,AutoCloseable
,org.apache.hadoop.io.Closeable
,org.apache.hadoop.mapred.JobConfigurable
,org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.IntWritable,
org.apache.hadoop.io.NullWritable, org.apache.avro.mapred.AvroWrapper<org.apache.avro.specific.SpecificRecord>, org.apache.hadoop.io.NullWritable>
public class ValidateSchemaAndBuildDictMapper
extends AbstractDataWriterTask
implements org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.IntWritable,org.apache.hadoop.io.NullWritable,org.apache.avro.mapred.AvroWrapper<org.apache.avro.specific.SpecificRecord>,org.apache.hadoop.io.NullWritable>, org.apache.hadoop.mapred.JobConfigurable
Mapper only MR to Validate Schema, Build compression dictionary if needed and persist
some data (total file size and compression dictionary) in HDFS to be used by the VPJ Driver
Note: processing all the files in this split are done sequentially and if it
results in significant increase in the mapper time or resulting in timeouts,
this needs to be revisited to be done via a thread pool.
TODO: Ideally, this class should not extend
AbstractDataWriterTask
. Soon, this MR job is going to be removed
and the handling will be moved to the VPJ driver.-
Field Summary
Modifier and TypeFieldDescriptionprotected boolean
protected InputDataInfoProvider.InputDataInfo
protected DefaultInputDataInfoProvider
protected String
protected Long
protected boolean
protected PushJobSetting
Fields inherited from class com.linkedin.venice.hadoop.task.datawriter.AbstractDataWriterTask
TASK_ID_NOT_SET
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionprotected void
checkLastModificationTimeAndLogError
(Exception e, String errorString) protected void
checkLastModificationTimeAndLogError
(Exception exception, String errorString, org.apache.hadoop.mapred.Reporter reporter) protected void
checkLastModificationTimeAndLogError
(String errorString, org.apache.hadoop.mapred.Reporter reporter) void
close()
void
configure
(org.apache.hadoop.mapred.JobConf job) protected void
configureTask
(VeniceProperties props) Allow implementations of this class to configure task-specific stuff.protected void
initInputData
(VeniceProperties props) void
map
(org.apache.hadoop.io.IntWritable inputKey, org.apache.hadoop.io.NullWritable inputValue, org.apache.hadoop.mapred.OutputCollector<org.apache.avro.mapred.AvroWrapper<org.apache.avro.specific.SpecificRecord>, org.apache.hadoop.io.NullWritable> output, org.apache.hadoop.mapred.Reporter reporter) protected boolean
processInput
(int fileIdx, org.apache.hadoop.mapred.Reporter reporter) 1.Methods inherited from class com.linkedin.venice.hadoop.task.datawriter.AbstractDataWriterTask
configure, getEngineTaskConfigProvider, getPartitionCount, getTaskId, isChunkingEnabled, isRmdChunkingEnabled, setChunkingEnabled
-
Field Details
-
inputDataInfoProvider
-
inputDataInfo
-
pushJobSetting
-
isZstdDictCreationRequired
protected boolean isZstdDictCreationRequired -
hasReportedFailure
protected boolean hasReportedFailure -
inputDirectory
-
inputFileDataSize
-
-
Constructor Details
-
ValidateSchemaAndBuildDictMapper
public ValidateSchemaAndBuildDictMapper()
-
-
Method Details
-
map
public void map(org.apache.hadoop.io.IntWritable inputKey, org.apache.hadoop.io.NullWritable inputValue, org.apache.hadoop.mapred.OutputCollector<org.apache.avro.mapred.AvroWrapper<org.apache.avro.specific.SpecificRecord>, org.apache.hadoop.io.NullWritable> output, org.apache.hadoop.mapred.Reporter reporter) throws IOException- Specified by:
map
in interfaceorg.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.IntWritable,
org.apache.hadoop.io.NullWritable, org.apache.avro.mapred.AvroWrapper<org.apache.avro.specific.SpecificRecord>, org.apache.hadoop.io.NullWritable> - Throws:
IOException
-
processInput
protected boolean processInput(int fileIdx, org.apache.hadoop.mapred.Reporter reporter) throws IOException 1. validate this file's schema against the first file's schema 2. Calculates total file size 3. Collect sample for dictionary from this file if enabled- Throws:
IOException
-
checkLastModificationTimeAndLogError
protected void checkLastModificationTimeAndLogError(Exception e, String errorString) throws IOException - Throws:
IOException
-
checkLastModificationTimeAndLogError
protected void checkLastModificationTimeAndLogError(String errorString, org.apache.hadoop.mapred.Reporter reporter) throws IOException - Throws:
IOException
-
checkLastModificationTimeAndLogError
protected void checkLastModificationTimeAndLogError(Exception exception, String errorString, org.apache.hadoop.mapred.Reporter reporter) throws IOException - Throws:
IOException
-
initInputData
- Throws:
Exception
-
configureTask
Description copied from class:AbstractDataWriterTask
Allow implementations of this class to configure task-specific stuff.- Specified by:
configureTask
in classAbstractDataWriterTask
- Parameters:
props
- the job props that the task was configured with.
-
configure
public void configure(org.apache.hadoop.mapred.JobConf job) - Specified by:
configure
in interfaceorg.apache.hadoop.mapred.JobConfigurable
-
close
public void close()- Specified by:
close
in interfaceAutoCloseable
- Specified by:
close
in interfaceCloseable
-