python - Scrapy Spider scraps content Partially and leaving others -
i have scrapy spider defined, can scrap names , storiesand xpath definded cannot capture stories,from https://www.cancercarenorthwest.com/survivor-stories,
# -*- coding: utf-8 -*- import scrapy scrapy.contrib.loader import itemloader scrapy.contrib.spiders import crawlspider,rule scrapy.selector import xmlxpathselector scrapy.contrib.linkextractors.sgml import sgmllinkextractor cancerstories.items import cancerstoriesitem class lungcancerspider(crawlspider): name = "lungcancer" allowed_domains = ["coloncancercoalition.org"] start_urls = ( 'http://www.coloncancercoalition.org/community/stories/survivor-stories/', ) rules = ( rule(sgmllinkextractor(allow=[r'http://www.coloncancercoalition.org/\d+/\d+/\d+/\w+']),callback='parse_page',follow=true), ) def parse_page(self, response): li = itemloader(item=cancerstoriesitem(),response=response) li.add_xpath('name', '/html/body/div[4]/div[1]/div[1]/div/h1/text()') li.add_xpath('story','//../div/div/p/text()') yield li.load_item()
i think need join texts of paragraphs under post content:
li.add_xpath('story', '//div[@class="post-content"]/div/p/text()', join(" "))
where join()
output processor imported as:
from scrapy.loader.processors import join
Comments
Post a Comment