java - Spark on YARN - saveAsTextFile() method creating lot of empty part files -


i running spark job on hadoop yarn cluster.

i using saveastextfile() method store rdd text file.

i can see more 150 empty part files created out of 250 files.

is there way can avoid this?

each partition written it's own file. empty partitions written empty files.

in order avoid writing empty files can either coalesce or repartition rdd smaller number of partitions.

if didn't expect have empty partitions, may worth investigating why have them. empty partitions can happen either due filtering step removed elements partitions, or due bad hash function. if hashcode() rdd's elements doesn't distribute elements well, it's possible end unbalanced rdd has empty partitions.


Comments

Popular posts from this blog

javascript - jQuery: Add class depending on URL in the best way -

caching - How to check if a url path exists in the service worker cache -

Redirect to a HTTPS version using .htaccess -