Class StageMetrics

java.lang.Object
com.linkedin.venice.spark.datawriter.task.StageMetrics
All Implemented Interfaces:
Serializable

public class StageMetrics extends Object implements Serializable
Holds per-stage diagnostic accumulators for tracking record counts, byte sizes, and timing through the VPJ Spark pipeline. One instance per pipeline stage (e.g., ttl_filter, compaction, compression_reencode, kafka_write).

Accumulators are registered with the SparkContext and aggregated on the driver after dataFrame.count() triggers execution.

Accuracy under speculative execution

Spark does not guarantee exactly-once accumulator semantics when speculative execution is enabled (spark.speculation=true). Under speculation, two task attempts for the same partition may both complete successfully. Spark discards the output of the slower attempt but does not roll back its accumulator deltas, so both attempts' contributions are applied to the driver-side value. This means all counters (recordsIn, recordsOut, bytesIn, bytesOut, timeNs) may be over-counted for any partition that was speculated. The degree of inflation is bounded by the number of speculated partitions.

These accumulators are intended for diagnostic purposes only (pipeline stage I/O ratios, throughput estimates, timing). They should not be used for correctness checks.

See Also:
  • Field Details

    • recordsIn

      public final org.apache.spark.util.LongAccumulator recordsIn
    • recordsOut

      public final org.apache.spark.util.LongAccumulator recordsOut
    • bytesIn

      public final org.apache.spark.util.LongAccumulator bytesIn
    • bytesOut

      public final org.apache.spark.util.LongAccumulator bytesOut
    • timeNs

      public final org.apache.spark.util.LongAccumulator timeNs
  • Constructor Details

    • StageMetrics

      public StageMetrics(org.apache.spark.SparkContext sparkContext, String stageName)
  • Method Details

    • getStageName

      public String getStageName()