1 year ago

#386065

test-img

Enrique Benito Casado

Spark: Write Spark Data Frame with Partition. When to use it and when should be avoid it

Having the following code:

sdf.write.partitionBy("day_of_insertion").format("delta").mode("append").save(path)

The partitioning is done through a column, but depending on the cardinality of that column I suppose it could be more or less interesting or totally null.

I understand that if we were to use a "user_id" as a partition column it would not make any sense and could even be detrimental as there are as many user_ids as rows.

From what percentage of the dataframe would it be interesting to make a partition? i.e.

if I have a dataset with 10.000 rows and 1000 different <attribute_of_partition>(so 10%). In our case the <attribute_of_partition> = "day_of_insertion"

apache-spark

apache-spark-sql

partitioning

0 Answers

Your Answer

Accepted video resources