Package com.linkedin.venice.spark.utils
Class SparkPartitionUtils
java.lang.Object
com.linkedin.venice.spark.utils.SparkPartitionUtils
Spark partitioning functionality in Dataframe and Dataset APIs is not very flexible. This class provides some
functionality by using the underlying RDD implementation.
-
Method Summary
Modifier and TypeMethodDescriptionstatic org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>
repartitionAndSortWithinPartitions
(org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> df, org.apache.spark.Partitioner partitioner, Comparator<org.apache.spark.sql.Row> comparator) This function provides the equivalent ofJavaPairRDD.repartitionAndSortWithinPartitions(org.apache.spark.Partitioner)
in Dataframe API.
-
Method Details
-
repartitionAndSortWithinPartitions
public static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> repartitionAndSortWithinPartitions(org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> df, org.apache.spark.Partitioner partitioner, Comparator<org.apache.spark.sql.Row> comparator) This function provides the equivalent ofJavaPairRDD.repartitionAndSortWithinPartitions(org.apache.spark.Partitioner)
in Dataframe API. 1. Convert toJavaPairRDD
2. UseJavaPairRDD.repartitionAndSortWithinPartitions(org.apache.spark.Partitioner)
to partition and perform primary and secondary sort 3. ConvertJavaPairRDD
toRDD
4. ConvertRDD
back to Dataframe
-