scrapy - Python LinkExtractor to go to next pages doesn't work -

next piece of code have try crawling site more 1 page... i'm having troubles getting rule class working. doing wrong?

#import scrapy scrapy.spiders import crawlspider, rule scrapy.linkextractors import linkextractor tutorial.items import skodaitem  class skodaspider(crawlspider):     name = "skodas"     allowed_domains = ["marktplaats.nl"]     start_urls = [         "http://www.marktplaats.nl/z/auto-s/skoda/octavia-trekhaak-stationwagon.html?categoryid=151&pricefrom=1.000%2c00&priceto=15.000%2c00&yearfrom=2010&mileageto=150.000&attributes=s%2c1185&attributes=s%2c484&attributes=m%2c11564&startdatefrom=always"     ]      rules = [         rule(linkextractor(restrict_xpaths=('//a[@class="button secondary medium pagination-next"]/a',)), follow=true),     ]  #    def parse_item(self, response):     def parse(self, response):         #self.logger.info('hi, item page! %s', response.url)         x = 0         items = []         sel in response.xpath('//*[@id="search-results"]/section[2]/article'):             x = x + 1             item = skodaitem()             item["title"] = sel.xpath('//*[@id="search-results"]/section[2]/article['+str(x)+']/div/div[1]/div[1]/h2/a/span').re('.+>(.+)</span>')             #print sel.xpath('//*[@id="search-results"]/section[2]/article['+str(x)+']/div/div[1]/div[1]/h2/a/span').extract()             item["leeftijd"] = sel.xpath('//*[@id="search-results"]/section[2]/article['+str(x)+']/div/div[1]/div[2]/span[1]').re('.+">(.+)</span>')             item["prijs"] = sel.xpath('//*[@id="search-results"]/section[2]/article['+str(x)+']/div/div[2]/div[1]/div/div').re('.+\n +(.+)\n.+')             item["km"] = sel.xpath('//*[@id="search-results"]/section[2]/article['+str(x)+']/div/div[1]/div[2]/span[3]').re('.+">(.+)</span>')              #handle output (print or safe database)             items.append(item)             print item ["title"],item["leeftijd"],item["prijs"],item["km"]

a few things change:

when using crawlspider, you should not redefine parse method, it's "magic" happens particular spider type

when writing crawl spider rules, avoid using parse callback, since crawlspider uses parse method implement logic. if override parse method, crawl spider no longer work.

as mentioned in comments, xpath needs fixing removing /a @ end (links in links not match element)
crawlspider rules need callback method if want extract items followed pages
to parse elements start urls, need define parse_start_url method

this minimalistic crawlspider following 3 pages sample input, , printing out how many "articles" there in each page:

from scrapy.spiders import crawlspider, rule scrapy.linkextractors import linkextractor  class skodaspider(crawlspider):     name = "skodas"     allowed_domains = ["marktplaats.nl"]     start_urls = [         "http://www.marktplaats.nl/z/auto-s/skoda/octavia-trekhaak-stationwagon.html?categoryid=151&pricefrom=1.000%2c00&priceto=15.000%2c00&yearfrom=2010&mileageto=150.000&attributes=s%2c1185&attributes=s%2c484&attributes=m%2c11564&startdatefrom=always"     ]      rules = [         rule(linkextractor(restrict_xpaths=('//a[@class="button secondary medium pagination-next"]',)),              follow=true,              callback='parse_page'),     ]      def parse_page(self, response):         articles = response.css('#search-results > section + section > article')         self.logger.info('%d articles' % len(articles))      # define this, otherwise "parse_page" not called urls in start_urls     parse_start_url = parse_page

output:

$ scrapy runspider 001.py  2016-02-09 11:07:16 [scrapy] info: scrapy 1.0.4 started (bot: scrapybot) 2016-02-09 11:07:16 [scrapy] info: optional features available: ssl, http11 2016-02-09 11:07:16 [scrapy] info: overridden settings: {} 2016-02-09 11:07:16 [scrapy] info: enabled extensions: closespider, telnetconsole, logstats, corestats, spiderstate 2016-02-09 11:07:16 [scrapy] info: enabled downloader middlewares: httpauthmiddleware, downloadtimeoutmiddleware, useragentmiddleware, retrymiddleware, defaultheadersmiddleware, metarefreshmiddleware, httpcompressionmiddleware, redirectmiddleware, cookiesmiddleware, chunkedtransfermiddleware, downloaderstats 2016-02-09 11:07:16 [scrapy] info: enabled spider middlewares: httperrormiddleware, offsitemiddleware, referermiddleware, urllengthmiddleware, depthmiddleware 2016-02-09 11:07:16 [scrapy] info: enabled item pipelines:  2016-02-09 11:07:16 [scrapy] info: spider opened 2016-02-09 11:07:16 [scrapy] info: crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-02-09 11:07:16 [scrapy] debug: telnet console listening on 127.0.0.1:6023 2016-02-09 11:07:16 [scrapy] debug: crawled (200) <get http://www.marktplaats.nl/z/auto-s/skoda/octavia-trekhaak-stationwagon.html?categoryid=151&pricefrom=1.000%2c00&priceto=15.000%2c00&yearfrom=2010&mileageto=150.000&attributes=s%2c1185&attributes=s%2c484&attributes=m%2c11564&startdatefrom=always> (referer: none) 2016-02-09 11:07:16 [skodas] info: 32 articles 2016-02-09 11:07:17 [scrapy] debug: crawled (200) <get http://www.marktplaats.nl/z/auto-s/skoda.html?attributes=s%2c1185+s%2c484+m%2c11564&categoryid=151&currentpage=2&mileageto=150.000&pricefrom=1.000%2c00&priceto=15.000%2c00&yearfrom=2010> (referer: http://www.marktplaats.nl/z/auto-s/skoda/octavia-trekhaak-stationwagon.html?categoryid=151&pricefrom=1.000%2c00&priceto=15.000%2c00&yearfrom=2010&mileageto=150.000&attributes=s%2c1185&attributes=s%2c484&attributes=m%2c11564&startdatefrom=always) 2016-02-09 11:07:17 [skodas] info: 30 articles 2016-02-09 11:07:17 [scrapy] debug: crawled (200) <get http://www.marktplaats.nl/z/auto-s/skoda.html?attributes=s%2c1185+s%2c484+m%2c11564&categoryid=151&currentpage=3&mileageto=150.000&pricefrom=1.000%2c00&priceto=15.000%2c00&yearfrom=2010> (referer: http://www.marktplaats.nl/z/auto-s/skoda.html?attributes=s%2c1185+s%2c484+m%2c11564&categoryid=151&currentpage=2&mileageto=150.000&pricefrom=1.000%2c00&priceto=15.000%2c00&yearfrom=2010) 2016-02-09 11:07:17 [skodas] info: 7 articles 2016-02-09 11:07:17 [scrapy] info: closing spider (finished) 2016-02-09 11:07:17 [scrapy] info: dumping scrapy stats: {'downloader/request_bytes': 1919,  'downloader/request_count': 3,  'downloader/request_method_count/get': 3,  'downloader/response_bytes': 96682,  'downloader/response_count': 3,  'downloader/response_status_count/200': 3,  'finish_reason': 'finished',  'finish_time': datetime.datetime(2016, 2, 9, 10, 7, 17, 638179),  'log_count/debug': 4,  'log_count/info': 10,  'request_depth_max': 2,  'response_received_count': 3,  'scheduler/dequeued': 3,  'scheduler/dequeued/memory': 3,  'scheduler/enqueued': 3,  'scheduler/enqueued/memory': 3,  'start_time': datetime.datetime(2016, 2, 9, 10, 7, 16, 452272)} 2016-02-09 11:07:17 [scrapy] info: spider closed (finished)

Search This Blog

Color

scrapy - Python LinkExtractor to go to next pages doesn't work -

Comments

Post a Comment

Popular posts from this blog

Redirect to a HTTPS version using .htaccess -

Unlimited choices in BASH case statement -

javascript - jQuery: Add class depending on URL in the best way -