python - Scrapy Spider scraps content Partially and leaving others -


i have scrapy spider defined, can scrap names , storiesand xpath definded cannot capture stories,from https://www.cancercarenorthwest.com/survivor-stories,

# -*- coding: utf-8 -*-  import scrapy scrapy.contrib.loader import itemloader scrapy.contrib.spiders import crawlspider,rule scrapy.selector import xmlxpathselector scrapy.contrib.linkextractors.sgml import sgmllinkextractor cancerstories.items import cancerstoriesitem  class lungcancerspider(crawlspider):     name = "lungcancer"     allowed_domains = ["coloncancercoalition.org"]     start_urls = (         'http://www.coloncancercoalition.org/community/stories/survivor-stories/',     )     rules = (              rule(sgmllinkextractor(allow=[r'http://www.coloncancercoalition.org/\d+/\d+/\d+/\w+']),callback='parse_page',follow=true),              )      def parse_page(self, response):         li = itemloader(item=cancerstoriesitem(),response=response)         li.add_xpath('name', '/html/body/div[4]/div[1]/div[1]/div/h1/text()')         li.add_xpath('story','//../div/div/p/text()')          yield li.load_item() 

i think need join texts of paragraphs under post content:

li.add_xpath('story', '//div[@class="post-content"]/div/p/text()', join(" ")) 

where join() output processor imported as:

from scrapy.loader.processors import join 

Comments

Popular posts from this blog

javascript - jQuery: Add class depending on URL in the best way -

caching - How to check if a url path exists in the service worker cache -

Redirect to a HTTPS version using .htaccess -