1 year ago
#386065

Enrique Benito Casado
Spark: Write Spark Data Frame with Partition. When to use it and when should be avoid it
Having the following code:
sdf.write.partitionBy("day_of_insertion").format("delta").mode("append").save(path)
The partitioning is done through a column, but depending on the cardinality of that column I suppose it could be more or less interesting or totally null.
I understand that if we were to use a "user_id" as a partition column it would not make any sense and could even be detrimental as there are as many user_ids as rows.
From what percentage of the dataframe would it be interesting to make a partition? i.e.
if I have a dataset with 10.000 rows and 1000 different <attribute_of_partition>(so 10%). In our case the <attribute_of_partition> = "day_of_insertion"
apache-spark
apache-spark-sql
partitioning
0 Answers
Your Answer