java - Spark on YARN - saveAsTextFile() method creating lot of empty part files -

January 15, 2013

i running spark job on hadoop yarn cluster.

i using saveastextfile() method store rdd text file.

i can see more 150 empty part files created out of 250 files.

is there way can avoid this?

each partition written it's own file. empty partitions written empty files.

in order avoid writing empty files can either coalesce or repartition rdd smaller number of partitions.

if didn't expect have empty partitions, may worth investigating why have them. empty partitions can happen either due filtering step removed elements partitions, or due bad hash function. if hashcode() rdd's elements doesn't distribute elements well, it's possible end unbalanced rdd has empty partitions.

Search This Blog

Color

java - Spark on YARN - saveAsTextFile() method creating lot of empty part files -

Comments

Post a Comment

Popular posts from this blog

Redirect to a HTTPS version using .htaccess -

Unlimited choices in BASH case statement -

javascript - jQuery: Add class depending on URL in the best way -