Distribute web-scraping write-to-file to parallel processes in Python? -
i'm scraping json data website, , need ~50,000 times (all data distinct zip codes on 3-year period). timed out program 1,000 calls, , average time per call 0.25 seconds, leaving me 3.5 hours of runtime whole range (all 50,000 calls).
how can distribute process across of cores? core of code pretty this:
with open("u:/dailyweather.txt", "r+") f: f.write("var1\tvar2\tvar3\tvar4\tvar5\tvar6\tvar7\tvar8\tvar9\n") writedata(zips, zip_weather_links, daypart)
where writedata()
looks this:
def writedata(zipcodes, links, dayparttime): z in zipcodes: pair in links: ## logic ## f.write("%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n" % (var1, var2, var3, var4, var5, var6, var7, var8, var9))
zips
looks this:
zips = ['55111', '56789', '68111', ...]
and zip_weather_links
dictionary of (url, date) each zip code:
zip_weather_links['55111'] = [('https://website.com/55111/data', datetime.datetime(2013, 1, 1, 0, 0, 0), ...]
how can distribute using pool
or multiprocessing
? or distribution save time?
you want "distribute web-scraping write-to-file parallel processes in python". start let's time used web-scraping.
the latency http-requests higher harddisks. link: latency comparison. small writes harddisk slower bigger writes. ssds have higher random write speed effect doesn't affect them much.
- distribute http-requests
- collect results
- write results @ once disk
some example code ipython parallel:
from ipyparallel import client import requests rc = client() lview = rc.load_balanced_view() worklist = ['http://xkcd.com/614/info.0.json', 'http://xkcd.com/613/info.0.json'] @lview.parallel() def get_webdata(w): import requests r = requests.get(w) if not r.status_code == 200: return (w, r.status_code,) return (w, r.json(),) #get_webdata called once every element of worklist proc = get_webdata.map(worklist) results = proc.get() # results list return values print(results[1]) # todo: write results disk
you have start ipython parallel workers first:
(py35)river:~ rene$ ipcluster start -n 20
Comments
Post a Comment