hash - Spark: Force two RDD[Key, Value] with co-located partitions using custom partitioner -

May 15, 2013

i have 2 rdd[k,v], k=long , v=object. lets call rdd1 , rdd2. have common custom partitioner. trying find way take union or join avoiding or minimizing data movement.

val kafkardd1 = /* kafka sources */ val kafkardd2 = /* kafka sources */  val rdd1 = kafkardd1.partitionby(new mycustompartitioner(24)) val rdd2 = kafkardd2.partitionby(new mycustompartitioner(24))  val rdd3 = rdd1.union(rdd2) // without shuffle val rdd3 = rdd1.leftouterjoin(rdd2) // without shuffle

is safe assume (or way enforce) nth-partition of both rdd1 , rdd2 on same slave node?

it not possible enforce* colocation in spark method use minimize data movement. when partitionerawareunionrdd created input rdds analyzed choose optimal output locations based on number of records per location. see getpreferredlocations method details.

* according high performance spark

two rdds colocated if have same partitioner , shuffled part of same action.

Search This Blog

Color

hash - Spark: Force two RDD[Key, Value] with co-located partitions using custom partitioner -

Comments

Post a Comment

Popular posts from this blog

Redirect to a HTTPS version using .htaccess -

Unlimited choices in BASH case statement -

javascript - jQuery: Add class depending on URL in the best way -