使用lxml和XPath刪除href標題

from lxml import html 
import requests 

for i in range(44,530):  # Number of pages plus one 
    url = "http://postscapes.com/companies/r/{}".format(i) 
    page = requests.get(url) 
    tree = html.fromstring(page.content) 

contactemail = tree.xpath('//*[@id="rt-mainbody"]/div/div/div[2]/div[4]/address/a') 

print contactemail

我試圖從公司目錄中的900個不同頁面中刮取電子郵件。 HTML代碼在每個頁面中都相對相似。但是，Contactemail返回元素值。上面的XPath是以下代碼的href值。我想提取只是標題[email protected]從href值通過XPath，但我不知道從哪裏開始。 我也希望這適用於不同的網頁，而不僅僅是這個href值/網頁。使用lxml和XPath刪除href標題

<a href="mailto:[email protected]">[email protected]</a>

我看着正則表達式，並試圖contactemail.textcontent()打印但它不工作。

任何提示？

來源

2016-03-09 Jonathan T Ho

有一些可能的方法來提取，即電子郵件地址相同的值，例如：

# get email address from inner text of the element : 
print contactemail[0].text 

# get email address from href attribute + substring-after() : 
print contactemail[0].xpath('substring-after(@href, "mailto:")')

您可以使用列表中理解語法，如果你可能有多個a元素在一個address父元素：

print [link.text for link in contactemail]

來源

2016-03-09 03:11:17 har07

嘿har07，感謝您的回覆。前兩個返回索引錯誤，列表超出頁面。最後一個工作，但它只返回None。 –

看起來像您的XPath無法找到目標元素。錯誤發生時URL中的數字是多少？嘗試將XPath簡化爲：'// * [@ id =「rt-mainbody」] // address/a' – har07

我簡化了xpath。它仍然返回None。關於索引錯誤發生的位置，它從http://postscapes.com/companies/r/44-開始，一直到530. 我嘗試了其他幾種技術 'for elt in contactemail： print（elt.text_content（））' 但它開始返回奇怪的值，其中包含電子郵件保護字。 –

使用lxml和XPath刪除href標題

回答

相關問題