Writing Spark RDD as Gzipped file in Amazon s3 -


i have output rdd in spark code written in python. want save in amazon s3 gzipped file. have tried following functions. below function correctly saves output rdd in s3 not in gzipped format.

output_rdd.saveastextfile("s3://<name-of-bucket>/") 

the below function returns error:: typeerror: saveashadoopfile() takes @ least 3 arguments (3 given)

output_rdd.saveashadoopfile("s3://<name-of-bucket>/",                          compressioncodecclass="org.apache.hadoop.io.compress.gzipcodec"                        ) 

please guide me correct way this.

you need specify output format well.

try this:

output_rdd.saveashadoopfile("s3://<name-of-bucket>/", "org.apache.hadoop.mapred.textoutputformat", compressioncodecclass="org.apache.hadoop.io.compress.gzipcodec") 

you can use of hadoop-supported compression codecs:

  • gzip: org.apache.hadoop.io.compress.gzipcodec
  • bzip2: org.apache.hadoop.io.compress.bzip2codec
  • lzo: com.hadoop.compression.lzo.lzopcodec

Comments

Popular posts from this blog

javascript - jQuery: Add class depending on URL in the best way -

caching - How to check if a url path exists in the service worker cache -

Redirect to a HTTPS version using .htaccess -