2016-07-02 64 views
1

在​​鏈接中,我想從r_compare_bars_value類的span標籤中獲取文本。如果您搜索該課程,您會看到文字爲104 (min: 88) fps,我只想要採取min:88部分。我的代碼;從span標籤中獲取文本問題

from bs4 import BeautifulSoup 
import urllib.request,requests 
r = urllib.request.urlopen('http://www.notebookcheck.net/Computer-Games-on-Laptop-Graphics-Cards.13849.0.html').read() 
soup = BeautifulSoup(r) 

links = [a['href'] for a in soup.select(".gpugames_header_games > a")] 

for url in links: 
    if url != "": 
     print (url) 
     rr = requests.get(url).content 
     soup = BeautifulSoup(rr,"html.parser") 

     for aa in soup.select("div.r_compare_bars_value span"): 
      print (aa) 
      if "min:" in aa.text: 
       print (aa.text) 

但是它現在沒有打印任何其他類別的大量字符串打印,而不是min:88部分。我也試過div.tx-nbc2fe-pi1,並嘗試沒有span標籤。該網站的代碼真的很糟糕。我的錯誤在哪裏,我該如何解決這個問題?

回答

0

有沒有辦法做到這一點沒有操縱由分裂返回的文本,剝離等。r_compare_bars_value其實也是一個跨度內沒有一個div所以soup.select("span.r_compare_bars_value")是正確的選擇。

其實,這是一個很好的用例的正則表達式:

from bs4 import BeautifulSoup 
import requests 
import re 
mn = re.compile("\(min:.*?\)") 

r = requests.get('http://www.notebookcheck.net/Computer-Games-on-Laptop-Graphics-Cards.13849.0.html').content 
soup = BeautifulSoup(r, "lxml") 

links = (a["href"] for a in soup.select(".gpugames_header_games > a")) 


for url in links: 
    if url: 
     rr = requests.get(url).content 
     soup = BeautifulSoup(rr, "html.parser") 
     for aa in soup.select("span.r_compare_bars_value"): 
      m = mn.search(aa.text) 
      if m: 
       print(m.group()) 

運行的幾個網址,上面爲您提供:

(min: 88) 
(min: 164) 
(min: 251) 
(min: 281) 
(min: 283) 
(min: 291) 
(min: 75) 
(min: 129) 
(min: 202) 
(min: 64) 
(min: 94) 
(min: 178) 
(min: 53) 
(min: 97) 
(min: 154) 
(min: 199) 
(min: 289) 
(min: 296) 
(min: 55) 
(min: 78) 
(min: 39) 
(min: 57) 
(min: 109) 
(min: 153) 
(min: 200) 
(min: 216) 
(min: 39) 
(min: 59) 
(min: 110)