Distribute web-scraping write-to-file to parallel processes in Python? -


i'm scraping json data website, , need ~50,000 times (all data distinct zip codes on 3-year period). timed out program 1,000 calls, , average time per call 0.25 seconds, leaving me 3.5 hours of runtime whole range (all 50,000 calls).

how can distribute process across of cores? core of code pretty this:

with open("u:/dailyweather.txt", "r+") f:     f.write("var1\tvar2\tvar3\tvar4\tvar5\tvar6\tvar7\tvar8\tvar9\n")     writedata(zips, zip_weather_links, daypart) 

where writedata() looks this:

def writedata(zipcodes, links, dayparttime):     z in zipcodes:         pair in links:             ## logic ##             f.write("%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n" % (var1, var2, var3, var4, var5,                                                                var6, var7, var8, var9)) 

zips looks this:

zips = ['55111', '56789', '68111', ...] 

and zip_weather_links dictionary of (url, date) each zip code:

zip_weather_links['55111'] = [('https://website.com/55111/data', datetime.datetime(2013, 1, 1, 0, 0, 0), ...] 

how can distribute using pool or multiprocessing? or distribution save time?

you want "distribute web-scraping write-to-file parallel processes in python". start let's time used web-scraping.

the latency http-requests higher harddisks. link: latency comparison. small writes harddisk slower bigger writes. ssds have higher random write speed effect doesn't affect them much.

  1. distribute http-requests
  2. collect results
  3. write results @ once disk

some example code ipython parallel:

from ipyparallel import client import requests rc = client() lview = rc.load_balanced_view() worklist = ['http://xkcd.com/614/info.0.json',             'http://xkcd.com/613/info.0.json']  @lview.parallel() def get_webdata(w):     import requests     r = requests.get(w)     if not r.status_code == 200:          return (w, r.status_code,)     return (w, r.json(),)  #get_webdata called once every element of worklist proc = get_webdata.map(worklist)  results = proc.get() # results list return values print(results[1]) # todo: write results disk 

you have start ipython parallel workers first:

(py35)river:~ rene$ ipcluster start -n 20      

Comments

Popular posts from this blog

javascript - jQuery: Add class depending on URL in the best way -

caching - How to check if a url path exists in the service worker cache -

Redirect to a HTTPS version using .htaccess -