如何檢索一個href標籤內的數據

嘿，我所遇到的一些困難，同時網絡爬行。我試圖獲得嵌入在一些html中間的代碼塊中的70，我的問題是我將如何去做這件事。我嘗試了各種方法，但似乎沒有工作。我正在使用BeautifulSoup模塊並使用Python 3編寫。如果有人需要它，鏈接到我正在抓取的網站的鏈接方便的鏈接。感謝您提前。如何檢索一個href標籤內的數據

<a href="http://www.accuweather.com/en/gb/london/ec4a-2/weather- forecast/328328">London, United Kingdom<span class="temp">70&deg;</span><span class="icon i-33-s"></span></a> 

from bs4 import* 
import requests 
data = requests.get("http://www.accuweather.com/en/gb/london/ec4a-2/weather- forecast/328328") 
soup = BeautifulSoup(data.text,"html.parser")

來源

2016-08-11 goimpress

-1

from bs4 import BeautifulSoup 
import re 
import requests 
soup = BeautifulSoup(text,"html.parser") 
for link in soup.find("a") 
    temp = link.find("span",{"class" : "temp"}) 
    print(re.findall(r"[0-9]{1,2}",temp.text))

我希望這有助於你

來源

2016-08-11 22:11:06 ChE

感謝您的評論！但打印出所有的鏈接，即時通訊試圖獲得「70」的標籤 – goimpress

假設使用BeautifulSoup不是一個嚴格的要求，你可以用html.parser模塊做到這一點。下面是爲您提到的用例定製設計的。它提取兩個數據字段，然後過濾出數字。

from html.parser import HTMLParser 

class MyHTMLParser(HTMLParser): 
    def handle_data(self, data): 
     if data.isdigit(): 
      print(data) 

parser = MyHTMLParser() 

parser.feed('<a href="http://www.accuweather.com/en/gb/london/ec4a-2/weather- forecast/328328">London, United Kingdom<span class="temp">70&deg;</span><span class="icon i-33-s"></span></a>')

這將輸出70

也可以使用正則表達式來完成。

來源

2016-08-11 22:26:25 v2b

它也可以這樣做，但即時通訊試圖網絡刮天氣網站，其中70是天氣和我發送的標籤是在一些html的中間 – goimpress

這將讓你含溫度

temps = soup.find_all('span',{'class':'temp'})

任何跨度然後遍歷它

for span in temps: 
    temp = span.decode_contents() 
    # temp looks like "70&deg" or "70\xb0" so parse it 
    print int(temp[:-1])

艱苦的工作可能是從Unicode轉換爲ASCII碼，如果你是在python2。

但ACCU-天氣頁面沒有帶班溫度跨度：

In [12]: soup.select('[class~=temp]') 
Out[12]: 
[<strong class="temp">19<span>\xb0</span></strong>, 
<strong class="temp">14<span>\xb0</span></strong>, 
<strong class="temp">24<span>\xb0</span></strong>, 
<strong class="temp">23<span>\xb0</span></strong>, 
<h2 class="temp">19\xb0</h2>, 
<h2 class="temp">19\xb0</h2>, 
<h2 class="temp">17\xb0</h2>, 
<h2 class="temp">19\xb0</h2>, 
<h2 class="temp">19\xb0</h2>, 
<h2 class="temp">19\xb0</h2>, 
<h2 class="temp">20\xb0</h2>, 
<h2 class="temp">19\xb0</h2>, 
<h2 class="temp">17\xb0</h2>, 
<h2 class="temp">19\xb0</h2>, 
<h2 class="temp">19\xb0</h2>]

所以它很難給你一個答案

來源

2016-08-11 22:31:58 kdopen

起初它看起來像是要工作，但它沒有 – goimpress

出了什麼問題？爲我工作 – kdopen

當然，精確天氣中的那個頁面不再使用具有臨界溫度的跨度。它使用'h2'和'strong'代替 – kdopen

您需要添加一個用戶代理，以獲得正確的來源，然後選擇您要使用的標籤/類名稱：

from bs4 import * 
import requests 
headers = {"user-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36"} 
data = requests.get("http://www.accuweather.com/en/gb/london/ec4a-2/weather-forecast/328328", headers=headers) 
soup = BeautifulSoup(data.content) 
print(soup.select_one("span.local-temp").text) 
print([span.text for span in soup.select("span.temp")])

如果我們運行的代碼，你會看到我們得到我們所需要的：

In [17]: headers = { 
    ....:  "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36"} 

In [18]: data = requests.get("http://www.accuweather.com/en/gb/london/ec4a-2/weather-forecast/328328", headers=headers) 

In [19]: soup = BeautifulSoup(data.content, "html.parser") 

In [20]: print(soup.find("span", "local-temp").text) 
18°C 

In [21]: print("\n".join([span.text for span in soup.select("span.temp")])) 
18° 
31° 
30° 
25°

來源

2016-08-11 22:49:06

兄弟！這工作。非常感謝你 – goimpress

不用擔心，當你右鍵點擊並選擇查看源代碼時，總是很好地檢查從請求返回的源代碼以及瀏覽器中的實際源代碼。 –

我有幾個問題，用戶代理是什麼，爲什麼你需要它。這些代碼行是什麼，soup.select_one（「span.local-temp」）。text和print（[span.text在span.template.select（「span.temp」）]） – goimpress

如何檢索一個href標籤內的數據

回答

相關問題