Spark: Write Spark Data Frame with Partition. When to use it and when - Enhance your coding expertise with Enrique Benito Casado on @onlycoders.net

2 years ago

#386065

Enrique Benito Casado

Spark: Write Spark Data Frame with Partition. When to use it and when should be avoid it

Having the following code:

sdf.write.partitionBy("day_of_insertion").format("delta").mode("append").save(path)

The partitioning is done through a column, but depending on the cardinality of that column I suppose it could be more or less interesting or totally null.

I understand that if we were to use a "user_id" as a partition column it would not make any sense and could even be detrimental as there are as many user_ids as rows.

From what percentage of the dataframe would it be interesting to make a partition? i.e.

if I have a dataset with 10.000 rows and 1000 different <attribute_of_partition>(so 10%). In our case the <attribute_of_partition> = "day_of_insertion"

apache-spark

apache-spark-sql

partitioning

0 Answers

Your Answer

Posts

Questions

Blogs