scala - Write Parquet files from Spark RDD to dynamic folders -


given following snippet (spark version: 1.5.2):

rdd.todf().write.mode(savemode.append).parquet(pathtostorage) 

which saves rdd data flattened parquet files, storage have structure like:

country/     year/         yearmonth/             yearmonthday/ 

the data contains country column , timestamp one, started this method. however, since have timestamp in data, can't partition whole thing year/yearmonth/yearmonthday not columns per se...

and this solution seemed pretty nice, except can't adapt parquet files...

any idea?

i figured out. in order path dynamically linked rdd, 1 first has create tuple rdd:

rdd.map(model => (model.country, model)) 

then, records have parsed, retrieve distinct countries:

val countries = rdd.map {         case (country, model) => country     }     .distinct()     .collect() 

now countries known, records can written according distinct country:

countries.map {     country => {         val countryrdd = rdd.filter {                 case (c, model) => c == country             }             .map(_._2)         countryrdd.todf().write.parquet(pathtostorage + "/" + country)     } }  

of course, whole collection has parsed twice, solution found far.

regarding timestamp, have same process 3-tuple (the third being 20160214); went current timestamp finally.


Comments

Popular posts from this blog

javascript - jQuery: Add class depending on URL in the best way -

caching - How to check if a url path exists in the service worker cache -

Redirect to a HTTPS version using .htaccess -