我目前正在通過學習基於Scraby的網頁抓取的基礎知識工作,並遇到了被重複而不是擴展的特定項目問題。在Scrapy中存儲一個項目時從多個鏈接獲取數據
我抓取數據的第一頁有一個鏈接選擇,我需要按照從其他鏈接刮取。這些鏈接存儲爲item ['link']。
我的問題是通過遍歷這些鏈接,通過嵌套在循環內的請求,結果不會被追加到原始項目實例,而是作爲新的實例返回。因此
結果看起來有點像下面這樣:
{'date': [u'29 June 2015', u'15 September 2015'],
'desc': [u'Audit Committee - 29 June 2015',
u'Audit Committee - 15 September 2015'],
'link': [u'/Council/Council-and-Committee-Minutes/Audit-Committee/2015/Audit-Committee-29-June-2015',
u'/Council/Council-and-Committee-Minutes/Audit-Committee/2015/Audit-Committee-15-September-2015'],
'pdf_url': 'http://www.antrimandnewtownabbey.gov.uk/Council/Council-and-Committee-Minutes/Audit-Committee/2015/Audit-Committee-15-September-2015',
'title': [u'2015']}
{'date': [u'29 June 2015', u'15 September 2015'],
'desc': [u'Audit Committee - 29 June 2015',
u'Audit Committee - 15 September 2015'],
'link': [u'/Council/Council-and-Committee-Minutes/Audit-Committee/2015/Audit-Committee-29-June-2015',
u'/Council/Council-and-Committee-Minutes/Audit-Committee/2015/Audit-Committee-15-September-2015'],
'pdf_url': 'http://www.antrimandnewtownabbey.gov.uk/Council/Council-and-Committee-Minutes/Audit-Committee/2015/Audit-Committee-29-June-2015',
'title': [u'2015']}
那裏,因爲我希望他們將被包含在像下面相同的對象:
{'date': [u'29 June 2015', u'15 September 2015'],
'desc': [u'Audit Committee - 29 June 2015',
u'Audit Committee - 15 September 2015'],
'link': [u'/Council/Council-and-Committee-Minutes/Audit-Committee/2015/Audit-Committee-29-June-2015',
u'/Council/Council-and-Committee-Minutes/Audit-Committee/2015/Audit-Committee-15-September-2015'],
'pdf_url': [u'http://www.antrimandnewtownabbey.gov.uk/Council/Council-and-Committee-Minutes/Audit-Committee/2015/Audit-Committee-29-June-2015',
u'http://www.antrimandnewtownabbey.gov.uk/Council/Council-and-Committee-Minutes/Audit-Committee/2015/Audit-Committee-15-September-2015'],
'title': [u'2015']}
這是我當前的實現(基於主要是Scrapy教程):
def parse(self, response):
for sel in response.xpath('//div[@class="lower-col-right"]'):
item = CouncilExtractorItem()
item['title'] = sel.xpath('header[@class="intro user-content font-set clearfix"] /h1/text()').extract()
item['link'] = sel.xpath('div[@class="user-content"] /section[@class="listing-item"]/a/@href').extract()
item['desc'] = sel.xpath('div[@class="user-content"] /section[@class="listing-item"]/a/h2/text()').extract()
item['date'] = sel.xpath('div[@class="user-content"] /section[@class="listing-item"]/span/text()').extract()
for url in item['link']:
full_url = response.urljoin(url)
request = scrapy.Request(full_url, callback=self.parse_page2)
request.meta['item'] = item
yield request
def parse_page2(self, response):
item = response.meta['item']
item['pdf'] = response.url
return item
我試了一下,但是導致問題的代碼一定有其他問題。 – user5520937
@ user5520937你確定你已經在每個內部xpath的開頭嘗試了'.//'嗎? – alecxe
是的 - 那是我遇到的最後一個問題 - 謝謝你指出 – user5520937