Class SparkPartitionUtils


  • public final class SparkPartitionUtils
    extends java.lang.Object
    Spark partitioning functionality in Dataframe and Dataset APIs is not very flexible. This class provides some functionality by using the underlying RDD implementation.
    • Method Summary

      All Methods Static Methods Concrete Methods 
      Modifier and Type Method Description
      static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> repartitionAndSortWithinPartitions​(org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> df, org.apache.spark.Partitioner partitioner, java.util.Comparator<org.apache.spark.sql.Row> comparator)
      This function provides the equivalent of JavaPairRDD.repartitionAndSortWithinPartitions(org.apache.spark.Partitioner) in Dataframe API.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Method Detail

      • repartitionAndSortWithinPartitions

        public static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> repartitionAndSortWithinPartitions​(org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> df,
                                                                                                                org.apache.spark.Partitioner partitioner,
                                                                                                                java.util.Comparator<org.apache.spark.sql.Row> comparator)
        This function provides the equivalent of JavaPairRDD.repartitionAndSortWithinPartitions(org.apache.spark.Partitioner) in Dataframe API. 1. Convert to JavaPairRDD 2. Use JavaPairRDD.repartitionAndSortWithinPartitions(org.apache.spark.Partitioner) to partition and perform primary and secondary sort 3. Convert JavaPairRDD to RDD 4. Convert RDD back to Dataframe