hash - Spark: Force two RDD[Key, Value] with co-located partitions using custom partitioner -


i have 2 rdd[k,v], k=long , v=object. lets call rdd1 , rdd2. have common custom partitioner. trying find way take union or join avoiding or minimizing data movement.

val kafkardd1 = /* kafka sources */ val kafkardd2 = /* kafka sources */  val rdd1 = kafkardd1.partitionby(new mycustompartitioner(24)) val rdd2 = kafkardd2.partitionby(new mycustompartitioner(24))  val rdd3 = rdd1.union(rdd2) // without shuffle val rdd3 = rdd1.leftouterjoin(rdd2) // without shuffle 

is safe assume (or way enforce) nth-partition of both rdd1 , rdd2 on same slave node?

it not possible enforce* colocation in spark method use minimize data movement. when partitionerawareunionrdd created input rdds analyzed choose optimal output locations based on number of records per location. see getpreferredlocations method details.


* according high performance spark

two rdds colocated if have same partitioner , shuffled part of same action.


Comments

Popular posts from this blog

javascript - jQuery: Add class depending on URL in the best way -

caching - How to check if a url path exists in the service worker cache -

Redirect to a HTTPS version using .htaccess -