hash - Spark: Force two RDD[Key, Value] with co-located partitions using custom partitioner -
i have 2 rdd[k,v], k=long , v=object. lets call rdd1 , rdd2. have common custom partitioner. trying find way take union or join avoiding or minimizing data movement.
val kafkardd1 = /* kafka sources */ val kafkardd2 = /* kafka sources */ val rdd1 = kafkardd1.partitionby(new mycustompartitioner(24)) val rdd2 = kafkardd2.partitionby(new mycustompartitioner(24)) val rdd3 = rdd1.union(rdd2) // without shuffle val rdd3 = rdd1.leftouterjoin(rdd2) // without shuffle is safe assume (or way enforce) nth-partition of both rdd1 , rdd2 on same slave node?
it not possible enforce* colocation in spark method use minimize data movement. when partitionerawareunionrdd created input rdds analyzed choose optimal output locations based on number of records per location. see getpreferredlocations method details.
* according high performance spark
two rdds colocated if have same partitioner , shuffled part of same action.
Comments
Post a Comment