Scrapy，從第一頁解析項目，然後從後續鏈接獲取附加項目

更新：我能夠得到這個移動，但它不會返回到子頁面，並重復序列。我試圖提取的數據是這樣的表格：Scrapy，從第一頁解析項目，然後從後續鏈接獲取附加項目

，我需要先收集DATE_1，source_1然後進入該鏈接的文章，重複...

任何幫助將不勝感激。 :)

from scrapy.spiders import BaseSpider, Rule 
from scrapy.selector import HtmlXPathSelector 
from scrapy.contrib.linkextractors import LinkExtractor 
from dirbot.items import WebsiteLoader 
from scrapy.http import Request 
from scrapy.http import HtmlResponse 



class DindexSpider(BaseSpider): 
name = "dindex" 
allowed_domains = ["newslookup.com"] 
start_urls = [ 
     "http://www.newslookup.com/Business/" 
] 

def parse_subpage(self, response): 
    self.log("Scraping: " + response.url) 
    il = response.meta['il'] 
    time = response.xpath('//div[@id="update_data"]//td[@class="stime3"]//text()').extract() 
    il.add_value('publish_date', time) 
    yield il.load_item() 


def parse(self, response): 
    self.log("Scraping: " + response.url) 
    hxs = HtmlXPathSelector(response) 
    sites = hxs.select('//td[@class="article"]') 

    for site in sites: 
     il = WebsiteLoader(response=response, selector=site) 
     il.add_xpath('name', 'a/text()') 
     il.add_xpath('url', 'a/@href') 
     yield Request("http://www.newslookup.com/Business/", meta={'il': il}, callback=self.parse_subpage)

來源

2016-02-03 J Fletcher

那只是因爲你需要使用CrawlSpider class代替BaseSpider：

from scrapy.spiders import CrawlSpider 

class DindexSpider(CrawlSpider): 
    # ...

來源

2016-02-03 02:52:19 alecxe

我終於蜘蛛運行，但現在這個錯誤「il.add_value（」停止時間'，響應['時間']） TypeError：'HtmlResponse'對象沒有屬性'__getitem__'「 –

搞笑我所做的只是改變parse_page解析並運行。我不知道爲什麼。 –

儘管蜘蛛啓動，它仍然關閉，從一個頁面提取某些項目，然後跟隨與這些項目相關聯的鏈接仍然沒有運行。 –

Scrapy，從第一頁解析項目，然後從後續鏈接獲取附加項目

回答

相關問題