multithreading - Scraping with a multithreaded queue + urllib3 suffers a drastic slowdown -

August 15, 2012

i trying scrape huge number of urls (approximately 3 millions) contains json-formatted data in shortest time possible. achieve this, have python code (python 3) uses queue, multithreading , urllib3. works fine during first 3 min, code begins slow down, appears totally stuck. have read find on issue unfortunately solution seems requires knowledge lies far beyond me.

i tried limit number of threads : did not fix anything. tried limit maxsize of queue , change socket timeout did no either. distant server not blocking me nor blacklisting me, able re-launch script time want results in beggining (the code starts slow down @ pretty random time). besides, internet connection seems cut - cannot surf on website - specific issue not appear every time.

here code (easy on me please, i'm begginer):

#!/usr/bin/env python import urllib3,json,csv queue import queue threading import thread  csvfile =  open("x.csv",  'wt',newline="") writer  =  csv.writer(csvfile,delimiter=";")          writer.writerow(('a','b','c','d'))  def do_stuff(q):     http = urllib3.connectionpool.connection_from_url('http://www.xxyx.com/',maxsize=30,timeout=20,block=true)      while true:          try:              url = q.get()             url1 = http.request('get',url)                    doc = json.loads(url1.data.decode('utf8'))              writer.writerow((doc['a'],doc['b'], doc['c'],doc['d']))          except:              print(url)          finally:             q.task_done()  q = queue(maxsize=200) num_threads = 15  in range(num_threads):     worker = thread(target=do_stuff, args=(q,))     worker.setdaemon(true)     worker.start()  x in range(1,3000000):     if x < 10:         url = "http://www.xxyx.com/?i=" + str(x) + "&plot=short&r=json"     elif x < 100:         url = "http://www.xxyx.com/?i=tt00000" + str(x) + "&plot=short&r=json"     elif x < 1000:         url = "http://www.xxyx.com/?i=0" + str(x) + "&plot=short&r=json"     elif x < 10000:         url = "http://www.xxyx.com/?i=00" + str(x) + "&plot=short&r=json"     elif x < 100000:         url = "http://www.xxyx.com/?i=000" + str(x) + "&plot=short&r=json"     elif x < 1000000:         url = "http://www.xxyx.com/?i=0000" + str(x) + "&plot=short&r=json"     else:         url = "http://www.xxyx.com/?i=00000" + str(x) + "&plot=short&r=json"      q.put(url)  q.join()     csvfile.close() print("done")

as shazow said, it's not matter of threads, timeouts @ each thread getting data server. try include timout in code:

finally:     sleep(50)     q.task_done()

it improved generating adaptive timeouts, example measure how data got, , if number decreases, increase sleep time, , vice versa

Search This Blog

Color

multithreading - Scraping with a multithreaded queue + urllib3 suffers a drastic slowdown -

Comments

Post a Comment

Popular posts from this blog

android - net_scheduler holding wakelock -

sql - MySQL : Getting Entries from a many-to-many table -

java - Retrieving data from database using jsp (Hibernate + Spring + Maven) -