multithreading - Scraping with a multithreaded queue + urllib3 suffers a drastic slowdown -
i trying scrape huge number of urls (approximately 3 millions) contains json-formatted data in shortest time possible. achieve this, have python code (python 3) uses queue, multithreading , urllib3. works fine during first 3 min, code begins slow down, appears totally stuck. have read find on issue unfortunately solution seems requires knowledge lies far beyond me.
i tried limit number of threads : did not fix anything. tried limit maxsize of queue , change socket timeout did no either. distant server not blocking me nor blacklisting me, able re-launch script time want results in beggining (the code starts slow down @ pretty random time). besides, internet connection seems cut - cannot surf on website - specific issue not appear every time.
here code (easy on me please, i'm begginer):
#!/usr/bin/env python import urllib3,json,csv queue import queue threading import thread csvfile = open("x.csv", 'wt',newline="") writer = csv.writer(csvfile,delimiter=";") writer.writerow(('a','b','c','d')) def do_stuff(q): http = urllib3.connectionpool.connection_from_url('http://www.xxyx.com/',maxsize=30,timeout=20,block=true) while true: try: url = q.get() url1 = http.request('get',url) doc = json.loads(url1.data.decode('utf8')) writer.writerow((doc['a'],doc['b'], doc['c'],doc['d'])) except: print(url) finally: q.task_done() q = queue(maxsize=200) num_threads = 15 in range(num_threads): worker = thread(target=do_stuff, args=(q,)) worker.setdaemon(true) worker.start() x in range(1,3000000): if x < 10: url = "http://www.xxyx.com/?i=" + str(x) + "&plot=short&r=json" elif x < 100: url = "http://www.xxyx.com/?i=tt00000" + str(x) + "&plot=short&r=json" elif x < 1000: url = "http://www.xxyx.com/?i=0" + str(x) + "&plot=short&r=json" elif x < 10000: url = "http://www.xxyx.com/?i=00" + str(x) + "&plot=short&r=json" elif x < 100000: url = "http://www.xxyx.com/?i=000" + str(x) + "&plot=short&r=json" elif x < 1000000: url = "http://www.xxyx.com/?i=0000" + str(x) + "&plot=short&r=json" else: url = "http://www.xxyx.com/?i=00000" + str(x) + "&plot=short&r=json" q.put(url) q.join() csvfile.close() print("done")
as shazow said, it's not matter of threads, timeouts @ each thread getting data server. try include timout in code:
finally: sleep(50) q.task_done() it improved generating adaptive timeouts, example measure how data got, , if number decreases, increase sleep time, , vice versa
Comments
Post a Comment