2016-02-08 29 views
1

我有一個scrapy蜘蛛定義,它可以取消所有的名字和一些storiesand中的XPath definded無法捕捉到的故事,從https://www.cancercarenorthwest.com/survivor-storiesScrapy蜘蛛下腳料內容部分,離開他人

# -*- coding: utf-8 -*- 

import scrapy 
from scrapy.contrib.loader import ItemLoader 
from scrapy.contrib.spiders import CrawlSpider,Rule 
from scrapy.selector import XmlXPathSelector 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from cancerstories.items import CancerstoriesItem 

class LungcancerSpider(CrawlSpider): 
    name = "lungcancer" 
    allowed_domains = ["coloncancercoalition.org"] 
    start_urls = (
     'http://www.coloncancercoalition.org/community/stories/survivor-stories/', 
    ) 
    rules = (
      Rule(SgmlLinkExtractor(allow=[r'http://www.coloncancercoalition.org/\d+/\d+/\d+/\w+']),callback='parse_page',follow=True), 
      ) 

    def parse_page(self, response): 
     Li = ItemLoader(item=CancerstoriesItem(),response=response) 
     Li.add_xpath('name', '/html/body/div[4]/div[1]/div[1]/div/h1/text()') 
     Li.add_xpath('story','//../div/div/p/text()') 

     yield Li.load_item() 

回答

1

我認爲你需要加入

​​

其中Join()是進口作爲輸出處理器:

from scrapy.loader.processors import Join 
0123的所有段落的帖子內容下的文本
+0

非常感謝!保存了我的早晨@alecxe – leboMagma