2016-05-17 53 views
0

編輯的問題鏈接到原:Scrapy:提取從源數據及其鏈接

Scrapy getting data from links within table

從鏈接https://www.tdcj.state.tx.us/death_row/dr_info/trottiewillielast.html

我想從主表信息以及對錶內其他2個鏈接內的數據。我設法從一個拉,但問題是去另一個鏈接,並在一行附加數據。

from urlparse import urljoin 

import scrapy 

from texasdeath.items import DeathItem 

class DeathItem(Item): 
    firstName = Field() 
    lastName = Field() 
    Age = Field() 
    Date = Field() 
    Race = Field() 
    County = Field() 
    Message = Field() 
    Passage = Field() 

class DeathSpider(scrapy.Spider): 
    name = "death" 
    allowed_domains = ["tdcj.state.tx.us"] 
    start_urls = [ 
     "http://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html" 
    ] 

    def parse(self, response): 
     sites = response.xpath('//table/tbody/tr') 
     for site in sites: 
      item = DeathItem() 

      item['firstName'] = site.xpath('td[5]/text()').extract() 
      item['lastName'] = site.xpath('td[4]/text()').extract() 
      item['Age'] = site.xpath('td[7]/text()').extract() 
      item['Date'] = site.xpath('td[8]/text()').extract() 
      item['Race'] = site.xpath('td[9]/text()').extract() 
      item['County'] = site.xpath('td[10]/text()').extract() 

      url = urljoin(response.url, site.xpath("td[3]/a/@href").extract_first()) 
      url2 = urljoin(response.url, site.xpath("td[2]/a/@href").extract_first()) 
      if url.endswith("html"): 
       request = scrapy.Request(url, meta={"item": item,"url2" : url2}, callback=self.parse_details) 
       yield request 
      else: 
       yield item 
def parse_details(self, response): 
    item = response.meta["item"] 
    url2 = response.meta["url2"] 
    item['Message'] = response.xpath("//p[contains(text(), 'Last Statement')]/following-sibling::p/text()").extract() 
    request = scrapy.Request(url2, meta={"item": item}, callback=self.parse_details2) 
    return request 

def parse_details2(self, response): 
    item = response.meta["item"] 
    item['Passage'] = response.xpath("//p/text()").extract_first() 
    return item 

我明白我們如何將參數傳遞給請求和元。但仍然不清楚流量,此時我不確定這是否可行。我已經看到了幾個例子,包括下面的:

using scrapy extracting data inside links

How can i use multiple requests and pass items in between them in scrapy python

技術上的數據將反映主表只是從它的鏈接中包含數據的兩個鏈接。

欣賞任何幫助或方向。

回答

2

在這種情況下的問題是在這一段代碼

if url.endswith("html"): 
     yield scrapy.Request(url, meta={"item": item}, callback=self.parse_details) 
    else: 
     yield item 

if url2.endswith("html"): 
     yield scrapy.Request(url2, meta={"item": item}, callback=self.parse_details2) 
    else: 
     yield item 

通過請求您要創建一個新的「線程」,將採取自己的人生航向,從而鏈接,功能parse_details不會能看看什麼是parse_details2正在做,我會做的方式是打電話內對方一個這樣

url = urljoin(response.url, site.xpath("td[2]/a/@href").extract_first()) 

url2 = urljoin(response.url, site.xpath("td[3]/a/@href").extract_first() 

if url.endswith("html"): 
    request=scrapy.Request(url, callback=self.parse_details) 
    request.meta['item']=item 
    request.meta['url2']=url2 
    yield request 
elif url2.endswith("html"): 
    request=scrapy.Request(url2, callback=self.parse_details2) 
    request.meta['item']=item 
    yield request 

else: 
    yield item 


def parse_details(self, response): 
    item = response.meta["item"] 
    url2 = response.meta["url2"] 
    item['About Me'] = response.xpath("//p[contains(text(), 'About Me')]/following-sibling::p/text()").extract() 
    if url2: 
     request=scrapy.Request(url2, callback=self.parse_details2) 
     request.meta['item']=item 
     yield request 
    else: 
     yield item 

此代碼沒有經過全面測試,以便評論,因爲你測試

+0

變得如此我排序錯誤'url2 = response.meta(「url2」)TypeError:'字典'對象不可調用。可能是我們如何通過response.meta。 – BernardL

+0

應該是因爲我們如何訪問字典。 – BernardL

+0

更新我的代碼並鏈接到我原來的問題。 – BernardL