scala - Write Parquet files from Spark RDD to dynamic folders -

July 15, 2014

given following snippet (spark version: 1.5.2):

rdd.todf().write.mode(savemode.append).parquet(pathtostorage)

which saves rdd data flattened parquet files, storage have structure like:

country/     year/         yearmonth/             yearmonthday/

the data contains country column , timestamp one, started this method. however, since have timestamp in data, can't partition whole thing year/yearmonth/yearmonthday not columns per se...

and this solution seemed pretty nice, except can't adapt parquet files...

any idea?

i figured out. in order path dynamically linked rdd, 1 first has create tuple rdd:

rdd.map(model => (model.country, model))

then, records have parsed, retrieve distinct countries:

val countries = rdd.map {         case (country, model) => country     }     .distinct()     .collect()

now countries known, records can written according distinct country:

countries.map {     country => {         val countryrdd = rdd.filter {                 case (c, model) => c == country             }             .map(_._2)         countryrdd.todf().write.parquet(pathtostorage + "/" + country)     } }

of course, whole collection has parsed twice, solution found far.

regarding timestamp, have same process 3-tuple (the third being 20160214); went current timestamp finally.

Search This Blog

Color

scala - Write Parquet files from Spark RDD to dynamic folders -

Comments

Post a Comment

Popular posts from this blog

Redirect to a HTTPS version using .htaccess -

Unlimited choices in BASH case statement -

javascript - jQuery: Add class depending on URL in the best way -