Writing Spark RDD as Gzipped file in Amazon s3 -

July 15, 2012

i have output rdd in spark code written in python. want save in amazon s3 gzipped file. have tried following functions. below function correctly saves output rdd in s3 not in gzipped format.

output_rdd.saveastextfile("s3://<name-of-bucket>/")

the below function returns error:: typeerror: saveashadoopfile() takes @ least 3 arguments (3 given)

output_rdd.saveashadoopfile("s3://<name-of-bucket>/",                          compressioncodecclass="org.apache.hadoop.io.compress.gzipcodec"                        )

please guide me correct way this.

you need specify output format well.

try this:

output_rdd.saveashadoopfile("s3://<name-of-bucket>/", "org.apache.hadoop.mapred.textoutputformat", compressioncodecclass="org.apache.hadoop.io.compress.gzipcodec")

you can use of hadoop-supported compression codecs:

gzip: org.apache.hadoop.io.compress.gzipcodec
bzip2: org.apache.hadoop.io.compress.bzip2codec
lzo: com.hadoop.compression.lzo.lzopcodec

Search This Blog

Color

Writing Spark RDD as Gzipped file in Amazon s3 -

Comments

Post a Comment

Popular posts from this blog

Redirect to a HTTPS version using .htaccess -

Unlimited choices in BASH case statement -

javascript - jQuery: Add class depending on URL in the best way -