java.lang.Object

com.linkedin.venice.spark.utils.SparkPartitionUtils

public final class SparkPartitionUtils extends Object

Spark partitioning functionality in Dataframe and Dataset APIs is not very flexible. This class provides some functionality by using the underlying RDD implementation.

Method Summary

Modifier and Type

Method

Description

static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>

repartitionAndSortWithinPartitions(org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> df, org.apache.spark.Partitioner partitioner, Comparator<org.apache.spark.sql.Row> comparator)

This function provides the equivalent of JavaPairRDD.repartitionAndSortWithinPartitions(org.apache.spark.Partitioner) in Dataframe API.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Method Details
- repartitionAndSortWithinPartitions
  
  public static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> repartitionAndSortWithinPartitions(org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> df, org.apache.spark.Partitioner partitioner, Comparator<org.apache.spark.sql.Row> comparator)
  
  This function provides the equivalent of JavaPairRDD.repartitionAndSortWithinPartitions(org.apache.spark.Partitioner) in Dataframe API. 1. Convert to JavaPairRDD 2. Use JavaPairRDD.repartitionAndSortWithinPartitions(org.apache.spark.Partitioner) to partition and perform primary and secondary sort 3. Convert JavaPairRDD to RDD 4. Convert RDD back to Dataframe

Class SparkPartitionUtils

Method Summary

Methods inherited from class java.lang.Object

Method Details

repartitionAndSortWithinPartitions