java - Spark on YARN - saveAsTextFile() method creating lot of empty part files -
i running spark job on hadoop yarn cluster.
i using saveastextfile()
method store rdd text file.
i can see more 150 empty part files created out of 250 files.
is there way can avoid this?
each partition written it's own file. empty partitions written empty files.
in order avoid writing empty files can either coalesce
or repartition
rdd smaller number of partitions.
if didn't expect have empty partitions, may worth investigating why have them. empty partitions can happen either due filtering step removed elements partitions, or due bad hash function. if hashcode()
rdd's elements doesn't distribute elements well, it's possible end unbalanced rdd has empty partitions.
Comments
Post a Comment