scrapy - Python LinkExtractor to go to next pages doesn't work -


next piece of code have try crawling site more 1 page... i'm having troubles getting rule class working. doing wrong?

#import scrapy scrapy.spiders import crawlspider, rule scrapy.linkextractors import linkextractor tutorial.items import skodaitem  class skodaspider(crawlspider):     name = "skodas"     allowed_domains = ["marktplaats.nl"]     start_urls = [         "http://www.marktplaats.nl/z/auto-s/skoda/octavia-trekhaak-stationwagon.html?categoryid=151&pricefrom=1.000%2c00&priceto=15.000%2c00&yearfrom=2010&mileageto=150.000&attributes=s%2c1185&attributes=s%2c484&attributes=m%2c11564&startdatefrom=always"     ]      rules = [         rule(linkextractor(restrict_xpaths=('//a[@class="button secondary medium pagination-next"]/a',)), follow=true),     ]  #    def parse_item(self, response):     def parse(self, response):         #self.logger.info('hi, item page! %s', response.url)         x = 0         items = []         sel in response.xpath('//*[@id="search-results"]/section[2]/article'):             x = x + 1             item = skodaitem()             item["title"] = sel.xpath('//*[@id="search-results"]/section[2]/article['+str(x)+']/div/div[1]/div[1]/h2/a/span').re('.+>(.+)</span>')             #print sel.xpath('//*[@id="search-results"]/section[2]/article['+str(x)+']/div/div[1]/div[1]/h2/a/span').extract()             item["leeftijd"] = sel.xpath('//*[@id="search-results"]/section[2]/article['+str(x)+']/div/div[1]/div[2]/span[1]').re('.+">(.+)</span>')             item["prijs"] = sel.xpath('//*[@id="search-results"]/section[2]/article['+str(x)+']/div/div[2]/div[1]/div/div').re('.+\n +(.+)\n.+')             item["km"] = sel.xpath('//*[@id="search-results"]/section[2]/article['+str(x)+']/div/div[1]/div[2]/span[3]').re('.+">(.+)</span>')              #handle output (print or safe database)             items.append(item)             print item ["title"],item["leeftijd"],item["prijs"],item["km"] 

a few things change:

when writing crawl spider rules, avoid using parse callback, since crawlspider uses parse method implement logic. if override parse method, crawl spider no longer work.

  • as mentioned in comments, xpath needs fixing removing /a @ end (links in links not match element)
  • crawlspider rules need callback method if want extract items followed pages
  • to parse elements start urls, need define parse_start_url method

this minimalistic crawlspider following 3 pages sample input, , printing out how many "articles" there in each page:

from scrapy.spiders import crawlspider, rule scrapy.linkextractors import linkextractor  class skodaspider(crawlspider):     name = "skodas"     allowed_domains = ["marktplaats.nl"]     start_urls = [         "http://www.marktplaats.nl/z/auto-s/skoda/octavia-trekhaak-stationwagon.html?categoryid=151&pricefrom=1.000%2c00&priceto=15.000%2c00&yearfrom=2010&mileageto=150.000&attributes=s%2c1185&attributes=s%2c484&attributes=m%2c11564&startdatefrom=always"     ]      rules = [         rule(linkextractor(restrict_xpaths=('//a[@class="button secondary medium pagination-next"]',)),              follow=true,              callback='parse_page'),     ]      def parse_page(self, response):         articles = response.css('#search-results > section + section > article')         self.logger.info('%d articles' % len(articles))      # define this, otherwise "parse_page" not called urls in start_urls     parse_start_url = parse_page 

output:

$ scrapy runspider 001.py  2016-02-09 11:07:16 [scrapy] info: scrapy 1.0.4 started (bot: scrapybot) 2016-02-09 11:07:16 [scrapy] info: optional features available: ssl, http11 2016-02-09 11:07:16 [scrapy] info: overridden settings: {} 2016-02-09 11:07:16 [scrapy] info: enabled extensions: closespider, telnetconsole, logstats, corestats, spiderstate 2016-02-09 11:07:16 [scrapy] info: enabled downloader middlewares: httpauthmiddleware, downloadtimeoutmiddleware, useragentmiddleware, retrymiddleware, defaultheadersmiddleware, metarefreshmiddleware, httpcompressionmiddleware, redirectmiddleware, cookiesmiddleware, chunkedtransfermiddleware, downloaderstats 2016-02-09 11:07:16 [scrapy] info: enabled spider middlewares: httperrormiddleware, offsitemiddleware, referermiddleware, urllengthmiddleware, depthmiddleware 2016-02-09 11:07:16 [scrapy] info: enabled item pipelines:  2016-02-09 11:07:16 [scrapy] info: spider opened 2016-02-09 11:07:16 [scrapy] info: crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-02-09 11:07:16 [scrapy] debug: telnet console listening on 127.0.0.1:6023 2016-02-09 11:07:16 [scrapy] debug: crawled (200) <get http://www.marktplaats.nl/z/auto-s/skoda/octavia-trekhaak-stationwagon.html?categoryid=151&pricefrom=1.000%2c00&priceto=15.000%2c00&yearfrom=2010&mileageto=150.000&attributes=s%2c1185&attributes=s%2c484&attributes=m%2c11564&startdatefrom=always> (referer: none) 2016-02-09 11:07:16 [skodas] info: 32 articles 2016-02-09 11:07:17 [scrapy] debug: crawled (200) <get http://www.marktplaats.nl/z/auto-s/skoda.html?attributes=s%2c1185+s%2c484+m%2c11564&categoryid=151&currentpage=2&mileageto=150.000&pricefrom=1.000%2c00&priceto=15.000%2c00&yearfrom=2010> (referer: http://www.marktplaats.nl/z/auto-s/skoda/octavia-trekhaak-stationwagon.html?categoryid=151&pricefrom=1.000%2c00&priceto=15.000%2c00&yearfrom=2010&mileageto=150.000&attributes=s%2c1185&attributes=s%2c484&attributes=m%2c11564&startdatefrom=always) 2016-02-09 11:07:17 [skodas] info: 30 articles 2016-02-09 11:07:17 [scrapy] debug: crawled (200) <get http://www.marktplaats.nl/z/auto-s/skoda.html?attributes=s%2c1185+s%2c484+m%2c11564&categoryid=151&currentpage=3&mileageto=150.000&pricefrom=1.000%2c00&priceto=15.000%2c00&yearfrom=2010> (referer: http://www.marktplaats.nl/z/auto-s/skoda.html?attributes=s%2c1185+s%2c484+m%2c11564&categoryid=151&currentpage=2&mileageto=150.000&pricefrom=1.000%2c00&priceto=15.000%2c00&yearfrom=2010) 2016-02-09 11:07:17 [skodas] info: 7 articles 2016-02-09 11:07:17 [scrapy] info: closing spider (finished) 2016-02-09 11:07:17 [scrapy] info: dumping scrapy stats: {'downloader/request_bytes': 1919,  'downloader/request_count': 3,  'downloader/request_method_count/get': 3,  'downloader/response_bytes': 96682,  'downloader/response_count': 3,  'downloader/response_status_count/200': 3,  'finish_reason': 'finished',  'finish_time': datetime.datetime(2016, 2, 9, 10, 7, 17, 638179),  'log_count/debug': 4,  'log_count/info': 10,  'request_depth_max': 2,  'response_received_count': 3,  'scheduler/dequeued': 3,  'scheduler/dequeued/memory': 3,  'scheduler/enqueued': 3,  'scheduler/enqueued/memory': 3,  'start_time': datetime.datetime(2016, 2, 9, 10, 7, 16, 452272)} 2016-02-09 11:07:17 [scrapy] info: spider closed (finished) 

Comments

Popular posts from this blog

javascript - jQuery: Add class depending on URL in the best way -

caching - How to check if a url path exists in the service worker cache -

Redirect to a HTTPS version using .htaccess -