Class SparkPartitionUtils

java.lang.Object
com.linkedin.venice.spark.utils.SparkPartitionUtils

public final class SparkPartitionUtils extends Object
Spark partitioning functionality in Dataframe and Dataset APIs is not very flexible. This class provides some functionality by using the underlying RDD implementation.
  • Method Summary

    Modifier and Type
    Method
    Description
    static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>
    repartitionAndSortWithinPartitions(org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> df, org.apache.spark.Partitioner partitioner, Comparator<org.apache.spark.sql.Row> comparator)
    This function provides the equivalent of JavaPairRDD.repartitionAndSortWithinPartitions(org.apache.spark.Partitioner) in Dataframe API.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Method Details

    • repartitionAndSortWithinPartitions

      public static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> repartitionAndSortWithinPartitions(org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> df, org.apache.spark.Partitioner partitioner, Comparator<org.apache.spark.sql.Row> comparator)
      This function provides the equivalent of JavaPairRDD.repartitionAndSortWithinPartitions(org.apache.spark.Partitioner) in Dataframe API. 1. Convert to JavaPairRDD 2. Use JavaPairRDD.repartitionAndSortWithinPartitions(org.apache.spark.Partitioner) to partition and perform primary and secondary sort 3. Convert JavaPairRDD to RDD 4. Convert RDD back to Dataframe