正確的元素的Xpath

我在python中使用請求和lxml來做到這一點。具體而言，我想要檢測到的主題的ID。

我寫了下面的XPath他們：

'//detectedTopic//@id'

這回什麼。

而下面的工作沒有任何問題：

'//@id'

在Chrome的開發者工具表明，第一個XPath確實指向正確的節點。

那麼它有什麼問題呢？

來源

2016-01-05 humblenoob

can yo你嘗試下面的兩個例子？ // detectedTopic/@id或 // detectedTopic –

如果使用lxml.html解析的內容，那麼HTMLParser使得自HTML is case-insensitive所有標籤小寫：

import requests 
url = 'http://wikipedia-miner.cms.waikato.ac.nz/services/wikify?source=At%20around%20the%20size%20of%20a%20domestic%20chicken,%20kiwi%20are%20by%20far%20the%20smallest%20living%20ratites%20and%20lay%20the%20largest%20egg%20in%20relation%20to%20their%20body%20size%20of%20any%20species%20of%20bird%20in%20the%20world' 
r = requests.get(url) 
content = r.content 

import lxml.html as LH 
html_root = LH.fromstring(content) 
print(LH.tostring(html_root))

產生

... 
    <detectedtopics> 
     <detectedtopic id="17362" title="Kiwi" weight="0.8601778098224363"></detectedtopic> 
     <detectedtopic id="21780446" title="Species" weight="0.6213590253455182"></detectedtopic> 
     <detectedtopic id="160220" title="Ratite" weight="0.5533763404831633"></detectedtopic> 
     <detectedtopic id="37402" title="Chicken" weight="0.528161911497278"></detectedtopic> 
    </detectedtopics>

但如果你使用lxml.etree來將內容解析爲XML，則情況不變：

import lxml.etree as ET 
xml_root = ET.fromstring(content) 
print(ET.tostring(xml_root))

產生

... 
    <detectedTopics> 
     <detectedTopic id="17362" title="Kiwi" weight="0.8601778098224363"/> 
     <detectedTopic id="21780446" title="Species" weight="0.6213590253455182"/> 
     <detectedTopic id="160220" title="Ratite" weight="0.5533763404831633"/> 
     <detectedTopic id="37402" title="Chicken" weight="0.528161911497278"/> 
    </detectedTopics>

內容看起來像XML不是HTML，所以你應該使用：

print(xml_root.xpath('//detectedTopic/@id')) 
['17362', '21780446', '160220', '37402']

如果內容被解析爲HTML，那麼的XPath將需要lowercased：

print(html_root.xpath('//detectedtopic/@id')) 
['17362', '21780446', '160220', '37402']

來源

2016-01-05 13:52:08 unutbu

很好的回答！我不知道'lxml.html'使標記小寫，因爲我習慣使用'lxml.etree'來進行xml解析。 –

您可以通過此獲得的ID：

'//detectedTopic/@id'

您也可以獲取代碼和提取您需要的屬性。例如：

for tag in tr.xpath('//detectedTopic'): 
    print tag.attrib.get('id') 
    print tag.attrib.get('title')

來源

2016-01-05 13:51:01

正確的元素的Xpath

回答

相關問題