scrapy - Python LinkExtractor to go to next pages doesn't work -
next piece of code have try crawling site more 1 page... i'm having troubles getting rule class working. doing wrong?
#import scrapy scrapy.spiders import crawlspider, rule scrapy.linkextractors import linkextractor tutorial.items import skodaitem class skodaspider(crawlspider): name = "skodas" allowed_domains = ["marktplaats.nl"] start_urls = [ "http://www.marktplaats.nl/z/auto-s/skoda/octavia-trekhaak-stationwagon.html?categoryid=151&pricefrom=1.000%2c00&priceto=15.000%2c00&yearfrom=2010&mileageto=150.000&attributes=s%2c1185&attributes=s%2c484&attributes=m%2c11564&startdatefrom=always" ] rules = [ rule(linkextractor(restrict_xpaths=('//a[@class="button secondary medium pagination-next"]/a',)), follow=true), ] # def parse_item(self, response): def parse(self, response): #self.logger.info('hi, item page! %s', response.url) x = 0 items = [] sel in response.xpath('//*[@id="search-results"]/section[2]/article'): x = x + 1 item = skodaitem() item["title"] = sel.xpath('//*[@id="search-results"]/section[2]/article['+str(x)+']/div/div[1]/div[1]/h2/a/span').re('.+>(.+)</span>') #print sel.xpath('//*[@id="search-results"]/section[2]/article['+str(x)+']/div/div[1]/div[1]/h2/a/span').extract() item["leeftijd"] = sel.xpath('//*[@id="search-results"]/section[2]/article['+str(x)+']/div/div[1]/div[2]/span[1]').re('.+">(.+)</span>') item["prijs"] = sel.xpath('//*[@id="search-results"]/section[2]/article['+str(x)+']/div/div[2]/div[1]/div/div').re('.+\n +(.+)\n.+') item["km"] = sel.xpath('//*[@id="search-results"]/section[2]/article['+str(x)+']/div/div[1]/div[2]/span[3]').re('.+">(.+)</span>') #handle output (print or safe database) items.append(item) print item ["title"],item["leeftijd"],item["prijs"],item["km"]
a few things change:
- when using
crawlspider
, you should not redefineparse
method, it's "magic" happens particular spider type
when writing crawl spider rules, avoid using parse callback, since crawlspider uses parse method implement logic. if override parse method, crawl spider no longer work.
- as mentioned in comments, xpath needs fixing removing
/a
@ end (links in links not match element) crawlspider
rules need callback method if want extract items followed pages- to parse elements start urls, need define
parse_start_url
method
this minimalistic crawlspider
following 3 pages sample input, , printing out how many "articles" there in each page:
from scrapy.spiders import crawlspider, rule scrapy.linkextractors import linkextractor class skodaspider(crawlspider): name = "skodas" allowed_domains = ["marktplaats.nl"] start_urls = [ "http://www.marktplaats.nl/z/auto-s/skoda/octavia-trekhaak-stationwagon.html?categoryid=151&pricefrom=1.000%2c00&priceto=15.000%2c00&yearfrom=2010&mileageto=150.000&attributes=s%2c1185&attributes=s%2c484&attributes=m%2c11564&startdatefrom=always" ] rules = [ rule(linkextractor(restrict_xpaths=('//a[@class="button secondary medium pagination-next"]',)), follow=true, callback='parse_page'), ] def parse_page(self, response): articles = response.css('#search-results > section + section > article') self.logger.info('%d articles' % len(articles)) # define this, otherwise "parse_page" not called urls in start_urls parse_start_url = parse_page
output:
$ scrapy runspider 001.py 2016-02-09 11:07:16 [scrapy] info: scrapy 1.0.4 started (bot: scrapybot) 2016-02-09 11:07:16 [scrapy] info: optional features available: ssl, http11 2016-02-09 11:07:16 [scrapy] info: overridden settings: {} 2016-02-09 11:07:16 [scrapy] info: enabled extensions: closespider, telnetconsole, logstats, corestats, spiderstate 2016-02-09 11:07:16 [scrapy] info: enabled downloader middlewares: httpauthmiddleware, downloadtimeoutmiddleware, useragentmiddleware, retrymiddleware, defaultheadersmiddleware, metarefreshmiddleware, httpcompressionmiddleware, redirectmiddleware, cookiesmiddleware, chunkedtransfermiddleware, downloaderstats 2016-02-09 11:07:16 [scrapy] info: enabled spider middlewares: httperrormiddleware, offsitemiddleware, referermiddleware, urllengthmiddleware, depthmiddleware 2016-02-09 11:07:16 [scrapy] info: enabled item pipelines: 2016-02-09 11:07:16 [scrapy] info: spider opened 2016-02-09 11:07:16 [scrapy] info: crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-02-09 11:07:16 [scrapy] debug: telnet console listening on 127.0.0.1:6023 2016-02-09 11:07:16 [scrapy] debug: crawled (200) <get http://www.marktplaats.nl/z/auto-s/skoda/octavia-trekhaak-stationwagon.html?categoryid=151&pricefrom=1.000%2c00&priceto=15.000%2c00&yearfrom=2010&mileageto=150.000&attributes=s%2c1185&attributes=s%2c484&attributes=m%2c11564&startdatefrom=always> (referer: none) 2016-02-09 11:07:16 [skodas] info: 32 articles 2016-02-09 11:07:17 [scrapy] debug: crawled (200) <get http://www.marktplaats.nl/z/auto-s/skoda.html?attributes=s%2c1185+s%2c484+m%2c11564&categoryid=151¤tpage=2&mileageto=150.000&pricefrom=1.000%2c00&priceto=15.000%2c00&yearfrom=2010> (referer: http://www.marktplaats.nl/z/auto-s/skoda/octavia-trekhaak-stationwagon.html?categoryid=151&pricefrom=1.000%2c00&priceto=15.000%2c00&yearfrom=2010&mileageto=150.000&attributes=s%2c1185&attributes=s%2c484&attributes=m%2c11564&startdatefrom=always) 2016-02-09 11:07:17 [skodas] info: 30 articles 2016-02-09 11:07:17 [scrapy] debug: crawled (200) <get http://www.marktplaats.nl/z/auto-s/skoda.html?attributes=s%2c1185+s%2c484+m%2c11564&categoryid=151¤tpage=3&mileageto=150.000&pricefrom=1.000%2c00&priceto=15.000%2c00&yearfrom=2010> (referer: http://www.marktplaats.nl/z/auto-s/skoda.html?attributes=s%2c1185+s%2c484+m%2c11564&categoryid=151¤tpage=2&mileageto=150.000&pricefrom=1.000%2c00&priceto=15.000%2c00&yearfrom=2010) 2016-02-09 11:07:17 [skodas] info: 7 articles 2016-02-09 11:07:17 [scrapy] info: closing spider (finished) 2016-02-09 11:07:17 [scrapy] info: dumping scrapy stats: {'downloader/request_bytes': 1919, 'downloader/request_count': 3, 'downloader/request_method_count/get': 3, 'downloader/response_bytes': 96682, 'downloader/response_count': 3, 'downloader/response_status_count/200': 3, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2016, 2, 9, 10, 7, 17, 638179), 'log_count/debug': 4, 'log_count/info': 10, 'request_depth_max': 2, 'response_received_count': 3, 'scheduler/dequeued': 3, 'scheduler/dequeued/memory': 3, 'scheduler/enqueued': 3, 'scheduler/enqueued/memory': 3, 'start_time': datetime.datetime(2016, 2, 9, 10, 7, 16, 452272)} 2016-02-09 11:07:17 [scrapy] info: spider closed (finished)
Comments
Post a Comment