scala - Write Parquet files from Spark RDD to dynamic folders -
given following snippet (spark version: 1.5.2):
rdd.todf().write.mode(savemode.append).parquet(pathtostorage)
which saves rdd data flattened parquet files, storage have structure like:
country/ year/ yearmonth/ yearmonthday/
the data contains country column , timestamp one, started this method. however, since have timestamp in data, can't partition whole thing year/yearmonth/yearmonthday not columns per se...
and this solution seemed pretty nice, except can't adapt parquet files...
any idea?
i figured out. in order path dynamically linked rdd, 1 first has create tuple rdd:
rdd.map(model => (model.country, model))
then, records have parsed, retrieve distinct countries:
val countries = rdd.map { case (country, model) => country } .distinct() .collect()
now countries known, records can written according distinct country:
countries.map { country => { val countryrdd = rdd.filter { case (c, model) => c == country } .map(_._2) countryrdd.todf().write.parquet(pathtostorage + "/" + country) } }
of course, whole collection has parsed twice, solution found far.
regarding timestamp, have same process 3-tuple (the third being 20160214
); went current timestamp finally.
Comments
Post a Comment