hash - Spark: Force two RDD[Key, Value] with co-located partitions using custom partitioner -
i have 2 rdd[k,v]
, k=long
, v=object
. lets call rdd1
, rdd2
. have common custom partitioner. trying find way take union
or join
avoiding or minimizing data movement.
val kafkardd1 = /* kafka sources */ val kafkardd2 = /* kafka sources */ val rdd1 = kafkardd1.partitionby(new mycustompartitioner(24)) val rdd2 = kafkardd2.partitionby(new mycustompartitioner(24)) val rdd3 = rdd1.union(rdd2) // without shuffle val rdd3 = rdd1.leftouterjoin(rdd2) // without shuffle
is safe assume (or way enforce) nth-partition
of both rdd1
, rdd2
on same slave
node?
it not possible enforce* colocation in spark method use minimize data movement. when partitionerawareunionrdd
created input rdds
analyzed choose optimal output locations based on number of records per location. see getpreferredlocations
method details.
* according high performance spark
two rdds colocated if have same partitioner , shuffled part of same action.
Comments
Post a Comment