Package com.linkedin.venice.spark.utils
Class SparkPartitionUtils
java.lang.Object
com.linkedin.venice.spark.utils.SparkPartitionUtils
Spark partitioning functionality in Dataframe and Dataset APIs is not very flexible. This class provides some
functionality by using the underlying RDD implementation.
-
Method Summary
Modifier and TypeMethodDescriptionstatic org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>repartitionAndSortWithinPartitions(org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> df, org.apache.spark.Partitioner partitioner, Comparator<org.apache.spark.sql.Row> comparator) This function provides the equivalent ofJavaPairRDD.repartitionAndSortWithinPartitions(org.apache.spark.Partitioner)in Dataframe API.
-
Method Details
-
repartitionAndSortWithinPartitions
public static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> repartitionAndSortWithinPartitions(org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> df, org.apache.spark.Partitioner partitioner, Comparator<org.apache.spark.sql.Row> comparator) This function provides the equivalent ofJavaPairRDD.repartitionAndSortWithinPartitions(org.apache.spark.Partitioner)in Dataframe API. 1. Convert toJavaPairRDD2. UseJavaPairRDD.repartitionAndSortWithinPartitions(org.apache.spark.Partitioner)to partition and perform primary and secondary sort 3. ConvertJavaPairRDDtoRDD4. ConvertRDDback to Dataframe
-