使用Python請求提取href URL

我想使用python中的請求包從xpath中提取URL。我可以獲取文本，但沒有任何我嘗試給出的URL。誰能幫忙？使用Python請求提取href URL

ipdb> webpage.xpath(xpath_url + '/text()') 
['Text of the URL'] 
ipdb> webpage.xpath(xpath_url + '/a()') 
*** lxml.etree.XPathEvalError: Invalid expression 
ipdb> webpage.xpath(xpath_url + '/href()') 
*** lxml.etree.XPathEvalError: Invalid expression 
ipdb> webpage.xpath(xpath_url + '/url()') 
*** lxml.etree.XPathEvalError: Invalid expression

我用這個教程開始：http://docs.python-guide.org/en/latest/scenarios/scrape/

現在看來似乎應該很容易，但沒有我的搜索過程中出現。

謝謝。

來源

2015-11-20 Struggling snowman

你能提供xpath_url的價值？在第一行，它看起來像xpath被正確解釋，但以下xpath語句可能不正確。 – jeedo

@jeedo您的評論幫助我意識到我的xpath完成了「div/h2/a」，因此根據jeremija的回答添加「/ @ href」就足夠了。謝謝。 –

您是否嘗試過webpage.xpath(xpath_url + '/@href') ？

下面是完整的代碼：

from lxml import html 
import requests 

page = requests.get('http://econpy.pythonanywhere.com/ex/001.html') 
webpage = html.fromstring(page.content) 

webpage.xpath('//a/@href')

結果應該是：

[ 
    'http://econpy.pythonanywhere.com/ex/002.html', 
    'http://econpy.pythonanywhere.com/ex/003.html', 
    'http://econpy.pythonanywhere.com/ex/004.html', 
    'http://econpy.pythonanywhere.com/ex/005.html' 
]

來源

2015-11-20 01:27:45 jeremija

謝謝！ '@ href'起作用。現在我需要去了解爲什麼它的文本是'text（）'而href是'@ href'。 –

我認爲這是因爲'@'用於引用元素的屬性，'text（）'返回所選節點的內容。 – jeremija

你會使用BeautifulSoup得到更好的服務：

from bs4 import BeautifulSoup 

html = requests.get('testurl.com') 
soup = BeautifulSoup(html, "lxml") # lxml is just the parser for reading the html 
soup.find_all('a href') # this is the line that does what you want

您可以打印線，將其添加到列表等要遍歷它，使用：

links = soup.find_all('a href') 
for link in links: 
    print(link)

來源

2015-11-20 01:18:10 n1c9

看來bs4是一種流行的做法。在這種情況下，我想繼續使用python請求，但這對於將來的參考很有用。謝謝。 –

使用Python請求提取href URL

回答

相關問題