Package com.linkedin.venice.spark.utils
Class SparkPartitionUtils
- java.lang.Object
-
- com.linkedin.venice.spark.utils.SparkPartitionUtils
-
public final class SparkPartitionUtils extends java.lang.Object
Spark partitioning functionality in Dataframe and Dataset APIs is not very flexible. This class provides some functionality by using the underlying RDD implementation.
-
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>
repartitionAndSortWithinPartitions(org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> df, org.apache.spark.Partitioner partitioner, java.util.Comparator<org.apache.spark.sql.Row> comparator)
This function provides the equivalent ofJavaPairRDD.repartitionAndSortWithinPartitions(org.apache.spark.Partitioner)
in Dataframe API.
-
-
-
Method Detail
-
repartitionAndSortWithinPartitions
public static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> repartitionAndSortWithinPartitions(org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> df, org.apache.spark.Partitioner partitioner, java.util.Comparator<org.apache.spark.sql.Row> comparator)
This function provides the equivalent ofJavaPairRDD.repartitionAndSortWithinPartitions(org.apache.spark.Partitioner)
in Dataframe API. 1. Convert toJavaPairRDD
2. UseJavaPairRDD.repartitionAndSortWithinPartitions(org.apache.spark.Partitioner)
to partition and perform primary and secondary sort 3. ConvertJavaPairRDD
toRDD
4. ConvertRDD
back to Dataframe
-
-