Scrapy-xpath返回基於正則表達式匹配的父節點

我試圖使用Scrapy來獲取網站的信息的復發。 Startpoint是一個列出URL的站點。我得到這些URL與Scrapy用下面的代碼：第1步：

def parse(self, response): 
    for href in response.css('.column a::attr(href)'): 
     full_url = response.urljoin(href.extract()) 
     yield { 'url': full_url, }

然後對於每個URL，我要去尋找特定的URL包含關鍵字（我做的每一步單獨的，因爲現在我是新來Scrapy最後我想用一個蜘蛛來運行它）：第2步：

def parse(self, response): 
    for href in response.xpath('//a[contains(translate(@href,"ABCDEFGHIJKLMNOPQRSTUVWXYZ","abcdefghijklmnopqrstuvwxyz"),"keyword")]/@href'): 
     full_url = response.urljoin(href.extract()) 
     yield { 'url': full_url, }

到目前爲止好，但隨後的最後一步：

第3步：我想從迴歸中得到具體的信息d URL，如果有的話。現在我遇到了麻煩; O）我嘗試幫兇什麼：

搜索與正則表達式其值/內容的正則表達式匹配的元素：（[0-9] [0-9] [0-9 ] [*] [AZ] [AZ]）>> this match 1234AB and/or 1234 AB
返回整個父div（後來，如果可能的話，我想返回上面的兩個父母if沒有父母的div，但這是爲了以後）。

所以當你拿下面的HTML代碼時，我想返回父div（）的內容。請注意，我不知道這個班級，所以我無法匹配。

<html> 
    <head> 
     <title>Webpage</title> 
    </head> 
    <body> 
     <h1 class="bookTitle">A very short ebook</h1> 
     <p style="text-align:right">some text</p> 
      <div class="contenttxt"> 
      <h1>Info</h1> 
     <h4>header text</h4> 

     <p>something<br /> 
     1234 AB</p> 

     <p>somthing else</p> 
     </div> 
     <h2 class="chapter">Chapter One</h2> 
     <p>This is a truly fascinating chapter.</p> 

     <h2 class="chapter">Chapter Two</h2> 
     <p>A worthy continuation of a fine tradition.</p> 
    </body> 
</html>

的代碼我想：

所有的

2016-05-31 18:59:32 [scrapy] INFO: Spider opened 
2016-05-31 18:59:32 [scrapy] DEBUG: Crawled (200) <GET http://localhost/test/test.html> (referer: None) 
[s] Available Scrapy objects: 
[s] crawler <scrapy.crawler.Crawler object at 0x7f6bc2be0e90> 
[s] item  {} 
[s] request <GET http://localhost/test/test.html> 
[s] response <200 http://localhost/test/test.html> 
[s] settings <scrapy.settings.Settings object at 0x7f6bc2be0d10> 
[s] spider  <DefaultSpider 'default' at 0x7f6bc2643b90> 
[s] Useful shortcuts: 
[s] shelp()   Shell help (print this help) 
[s] fetch(req_or_url) Fetch request (or URL) and update local objects 
[s] view(response) View response in a browser 
>>> response.xpath('//*').re('([0-9][0-9][0-9][0-9].*[A-Z][A-Z])') 
[u'1234 AB', u'1234 AB', u'1234 AB', u'1234 AB']

首先，它返回匹配的4倍，所以至少要能找到的東西。我搜索了「scrapy的XPath返回父節點」，但只給了我一個「解決方案」爲獲得結果只有一個：

>>> response.xpath('//*/../../../..').re('([0-9][0-9][0-9][0-9].*[A-Z][A-Z])') 
[u'1234 AB']

我也試過類似：

>>> for nodes in response.xpath('//*').re('([0-9][0-9][0-9][0-9].*[A-Z][A-Z])'): 
...  for i in nodes.xpath('ancestor:://*'): 
...   print i 
... 
Traceback (most recent call last): 
    File "<console>", line 2, in <module> 
AttributeError: 'unicode' object has no attribute 'xpath'

，但沒有幫助。希望有人能指出我正確的方向。首先，因爲我不明白爲什麼正則表達式匹配4次，其次是因爲我沒有想法到達我想要的位置。只是回顧了大多數有希望的結果，即'可能已經有你的答案的問題'顯示出來。但是沒有找到我的解決方案。我最好的猜測是，我必須建立某種循環，但是再一次沒有線索。：■

最後我試圖得到一個其輸出與數據包含在步驟1和步驟二中發現的URL，共同步驟的結果3.

謝謝！ KR， Onno。

來源

2016-05-31 user3262645

re方法在xpath選擇器提取感興趣的元素後提取數據，請檢查documentation以獲取更多信息。如果您知道元素（在這種情況下可能爲div），則可以遍歷所有div來檢查其內容，或者使用scrapy對XPath中正則表達式的內置支持;使用前面的例子，這樣的事情：

response.xpath('//div[re:test(., "[0-9]{4}\s?[A-Z]{2}")]').extract()

回報

[u'<div class="contenttxt">\n   <h1>Info</h1>\n  <h4>header text</h4>\n\n  <p>something<br>\n  1234 AB</p>\n\n  <p>somthing else</p>\n  </div>']

來源

2016-06-03 20:44:47 Wilfredo

Scrapy-xpath返回基於正則表達式匹配的父節點

回答

相關問題