python - Selenium scraping with multiple urls -


following previous question, i'm trying scrape multiple pages of url (all pages games in given season). i'm trying scrape multiple parent urls (seasons):

from selenium import webdriver import pandas pd import time  url = ['http://www.oddsportal.com/hockey/austria/ebel-2014-2015/results/#/page/',         'http://www.oddsportal.com/hockey/austria/ebel-2013-2014/results/#/page/']  data = []  in url:     j in range(1,8):         print i+str(j)                 driver = webdriver.phantomjs()                 driver.implicitly_wait(10)                 driver.get(i+str(j))           match in driver.find_elements_by_css_selector("div#tournamenttable tr.deactivate"):             home, away = match.find_element_by_class_name("table-participant").text.split(" - ")             date = match.find_element_by_xpath(".//preceding::th[contains(@class, 'first2')][1]").text              if " - " in date:                 date, event = date.split(" - ")             else:                 event = "not specified"              data.append({                 "home": home.strip(),                 "away": away.strip(),                 "date": date.strip(),                 "event": event.strip()             })          driver.close()         time.sleep(3)         print str(j)+" ok"  df = pd.dataframe(data) print df  # ok 6 results socket.error: [errno 10054] existing connection forcibly closed remote host # ok 2 results, infinite load # added time.sleep(3) # ok first result, infinite load after # added implicitly wait # no result, infinite load 

at first tried code twice without either implicit wait on line 14 or sleep on 35. first result gave socket error. second result stalled no error after 2 scraped pages.

then added time waits noted above , haven't helped.

since results not consistent, guess connection reset between end of loop & next run. i'd know if that's solution , how implement. checked robots.txt of site , can't see prevents scraping after set interval.

secondly, scraper gets 90% of pages, stalls (infinite wait). there way have retry loop after x seconds save you've got , retry stalled point again?

what need is:

  • reuse same webdriver instance - not initialize in loop
  • introduce explicit waits - make code more reliable , fast

implementation:

from selenium import webdriver selenium.webdriver.common.by import selenium.webdriver.support.ui import webdriverwait selenium.webdriver.support import expected_conditions ec  import pandas pd   urls = [     'http://www.oddsportal.com/hockey/austria/ebel-2014-2015/results/#/page/',     'http://www.oddsportal.com/hockey/austria/ebel-2013-2014/results/#/page/' ]  data = []  driver = webdriver.phantomjs() driver.implicitly_wait(10) wait = webdriverwait(driver, 10)  url in urls:     page in range(1, 8):         driver.get(url + str(page))         # wait page load         wait.until(ec.visibility_of_element_located((by.css_selector, "div#tournamenttable tr.deactivate")))          match in driver.find_elements_by_css_selector("div#tournamenttable tr.deactivate"):             home, away = match.find_element_by_class_name("table-participant").text.split(" - ")             date = match.find_element_by_xpath(".//preceding::th[contains(@class, 'first2')][1]").text              if " - " in date:                 date, event = date.split(" - ")             else:                 event = "not specified"              data.append({                 "home": home.strip(),                 "away": away.strip(),                 "date": date.strip(),                 "event": event.strip()             })  driver.close()  df = pd.dataframe(data) print(df) 

prints:

                   away         date          event                home 0              salzburg  14 apr 2015      play offs     vienna capitals 1       vienna capitals  12 apr 2015      play offs            salzburg 2              salzburg  10 apr 2015      play offs     vienna capitals 3       vienna capitals  07 apr 2015      play offs            salzburg 4       vienna capitals  31 mar 2015      play offs         liwest linz 5              salzburg  29 mar 2015      play offs          klagenfurt 6           liwest linz  29 mar 2015      play offs     vienna capitals 7            klagenfurt  26 mar 2015      play offs            salzburg 8       vienna capitals  26 mar 2015      play offs         liwest linz 9           liwest linz  24 mar 2015      play offs     vienna capitals 10             salzburg  24 mar 2015      play offs          klagenfurt 11           klagenfurt  22 mar 2015      play offs            salzburg 12      vienna capitals  22 mar 2015      play offs         liwest linz 13              bolzano  20 mar 2015      play offs         liwest linz 14        fehervar av19  18 mar 2015      play offs     vienna capitals 15          liwest linz  17 mar 2015      play offs             bolzano 16      vienna capitals  16 mar 2015      play offs       fehervar av19 17              villach  15 mar 2015      play offs            salzburg 18           klagenfurt  15 mar 2015      play offs              znojmo 19              bolzano  15 mar 2015      play offs         liwest linz 20          liwest linz  13 mar 2015      play offs             bolzano 21        fehervar av19  13 mar 2015      play offs     vienna capitals 22               znojmo  13 mar 2015      play offs          klagenfurt 23             salzburg  13 mar 2015      play offs             villach 24           klagenfurt  10 mar 2015      play offs              znojmo 25      vienna capitals  10 mar 2015      play offs       fehervar av19 26              bolzano  10 mar 2015      play offs         liwest linz 27              villach  10 mar 2015      play offs            salzburg 28          liwest linz  08 mar 2015      play offs             bolzano 29               znojmo  08 mar 2015      play offs          klagenfurt ..                  ...          ...            ...                 ... 670       twk innsbruck  28 sep 2013  not specified              znojmo 671         liwest linz  27 sep 2013  not specified            dornbirn 672             bolzano  27 sep 2013  not specified          graz 99ers 673          klagenfurt  27 sep 2013  not specified  olimpija ljubljana 674       fehervar av19  27 sep 2013  not specified            salzburg 675       twk innsbruck  27 sep 2013  not specified     vienna capitals 676             villach  27 sep 2013  not specified              znojmo 677            salzburg  24 sep 2013  not specified  olimpija ljubljana 678            dornbirn  22 sep 2013  not specified       twk innsbruck 679          graz 99ers  22 sep 2013  not specified          klagenfurt 680     vienna capitals  22 sep 2013  not specified             villach 681       fehervar av19  21 sep 2013  not specified             bolzano 682            dornbirn  20 sep 2013  not specified             bolzano 683             villach  20 sep 2013  not specified          graz 99ers 684              znojmo  20 sep 2013  not specified          klagenfurt 685  olimpija ljubljana  20 sep 2013  not specified         liwest linz 686       fehervar av19  20 sep 2013  not specified       twk innsbruck 687            salzburg  20 sep 2013  not specified     vienna capitals 688             villach  15 sep 2013  not specified          klagenfurt 689         liwest linz  15 sep 2013  not specified            dornbirn 690     vienna capitals  15 sep 2013  not specified       fehervar av19 691       twk innsbruck  15 sep 2013  not specified            salzburg 692          graz 99ers  15 sep 2013  not specified              znojmo 693  olimpija ljubljana  14 sep 2013  not specified            dornbirn 694             bolzano  14 sep 2013  not specified       fehervar av19 695          klagenfurt  13 sep 2013  not specified          graz 99ers 696              znojmo  13 sep 2013  not specified            salzburg 697  olimpija ljubljana  13 sep 2013  not specified       twk innsbruck 698             bolzano  13 sep 2013  not specified     vienna capitals 699         liwest linz  13 sep 2013  not specified             villach  [700 rows x 4 columns] 

Comments

Popular posts from this blog

javascript - jQuery: Add class depending on URL in the best way -

caching - How to check if a url path exists in the service worker cache -

Redirect to a HTTPS version using .htaccess -