使用lxml.html與BeautifulSoup定位元素

我使用lxml.html來刮取html文檔;有一件事我可以在BeautifulSoup中做，但不能處理lxml.htm。這是它：使用lxml.html與BeautifulSoup定位元素

from BeautifulSoup import BeautifulSoup 
import re 

doc = ['<html>', 
'<h2> some text </h2>', 
'<p> some more text </p>', 
'<table> <tr> <td> A table</td> </tr> </table>', 
'<h2> some special text </h2>', 
'<p> some more text </p>', 
'<table> <tr> <td> The table I want </td> </tr> </table>', 
'</html>'] 
soup = BeautifulSoup(''.join(doc)) 
print soup.find(text=re.compile("special")).findNext('table')

我試過這與cssselect，但沒有成功。有關如何使用lxml.html中的方法找到此問題的任何想法？

非常感謝， d

來源

2011-04-23 djas

爲什麼你需要的常量字符串'了're.compile' 「特殊」'？ – 2011-04-23 14:58:28

另外，我個人總是發現'BeautifulSoup'比'lxml'更方便HTML「抓取」 – 2011-04-23 15:00:11

Hi @Eli，感謝您的評論。我也不確定爲什麼我需要're.compile'，但事實是'print soup.find（text =「special」）。findNext（'table'）'不起作用。此外，似乎'BeautifulSoup'不再被維護，請參閱http://tiny.cc/d1lir。 – djas 2011-04-23 15:45:00

您可以在LXML Xpath的使用正則表達式，使用EXSLT syntax。例如，假設您的文檔，這將選擇其文字的正則表達式匹配spe.*al父節點：

import re 
import lxml.html 

NS = 'http://exslt.org/regular-expressions' 
tree = lxml.html.fromstring(DOC) 

# select sibling table nodes after matching node 
path = "//*[re:test(text(), 'spe.*al')]/following-sibling::table" 
print tree.xpath(path, namespaces={'re': NS}) 

# select all sibling nodes after matching node 
path = "//*[re:test(text(), 'spe.*al')]/following-sibling::*" 
print tree.xpath(path, namespaces={'re': NS})

輸出：

[<Element table at 7fe21acd3f58>] 
[<Element p at 7f76ac2c3f58>, <Element table at 7f76ac2e6050>]

來源

2011-04-23 15:17:43 samplebias

謝謝，但是我要找的不是匹配文本的父節點（這裏是h2）;它（在這個例子中是兄弟，但更一般地）是該節點之後的元素。 – djas 2011-04-23 15:57:17

您應該可以使用[xpath axes]（http://www.w3schools.com/xpath/xpath_axes.asp）來準確選擇要查找的內容。我已經更新選擇'table'節點，但是可以根據需要概括路徑。 – samplebias 2011-04-23 16:06:14

samplebias在這裏給出了一個很好的答案。 xpath非常強大;比BS提供的工具強大得多（儘管你可以在lxml中使用BS分析器）。 BS閃耀着極其破碎的HTML，但對於普通情況下，lxml更加靈活（具有較大的二進制依賴性）。 – 2011-04-23 17:29:58

使用lxml.html與BeautifulSoup定位元素

回答

相關問題