編輯的問題鏈接到原:Scrapy:提取從源數據及其鏈接
Scrapy getting data from links within table
從鏈接https://www.tdcj.state.tx.us/death_row/dr_info/trottiewillielast.html
我想從主表信息以及對錶內其他2個鏈接內的數據。我設法從一個拉,但問題是去另一個鏈接,並在一行附加數據。
from urlparse import urljoin
import scrapy
from texasdeath.items import DeathItem
class DeathItem(Item):
firstName = Field()
lastName = Field()
Age = Field()
Date = Field()
Race = Field()
County = Field()
Message = Field()
Passage = Field()
class DeathSpider(scrapy.Spider):
name = "death"
allowed_domains = ["tdcj.state.tx.us"]
start_urls = [
"http://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html"
]
def parse(self, response):
sites = response.xpath('//table/tbody/tr')
for site in sites:
item = DeathItem()
item['firstName'] = site.xpath('td[5]/text()').extract()
item['lastName'] = site.xpath('td[4]/text()').extract()
item['Age'] = site.xpath('td[7]/text()').extract()
item['Date'] = site.xpath('td[8]/text()').extract()
item['Race'] = site.xpath('td[9]/text()').extract()
item['County'] = site.xpath('td[10]/text()').extract()
url = urljoin(response.url, site.xpath("td[3]/a/@href").extract_first())
url2 = urljoin(response.url, site.xpath("td[2]/a/@href").extract_first())
if url.endswith("html"):
request = scrapy.Request(url, meta={"item": item,"url2" : url2}, callback=self.parse_details)
yield request
else:
yield item
def parse_details(self, response):
item = response.meta["item"]
url2 = response.meta["url2"]
item['Message'] = response.xpath("//p[contains(text(), 'Last Statement')]/following-sibling::p/text()").extract()
request = scrapy.Request(url2, meta={"item": item}, callback=self.parse_details2)
return request
def parse_details2(self, response):
item = response.meta["item"]
item['Passage'] = response.xpath("//p/text()").extract_first()
return item
我明白我們如何將參數傳遞給請求和元。但仍然不清楚流量,此時我不確定這是否可行。我已經看到了幾個例子,包括下面的:
using scrapy extracting data inside links
How can i use multiple requests and pass items in between them in scrapy python
技術上的數據將反映主表只是從它的鏈接中包含數據的兩個鏈接。
欣賞任何幫助或方向。
變得如此我排序錯誤'url2 = response.meta(「url2」)TypeError:'字典'對象不可調用。可能是我們如何通過response.meta。 – BernardL
應該是因爲我們如何訪問字典。 – BernardL
更新我的代碼並鏈接到我原來的問題。 – BernardL