2017-01-18 110 views
3

我試圖解析Python的Beautifulsoup解析HTML

<td height="16" class="listtable_1"><a href="http://steamcommunity.com/profiles/76561198134729239" target="_blank">76561198134729239</a></td> 

的76561198134729239.,我無法弄清楚如何做到這一點。我的嘗試:

import requests 
from lxml import html 
from bs4 import BeautifulSoup 
r = requests.get("http://ppm.rep.tf/index.php?p=banlist&page=154") 
content = r.content 
soup = BeautifulSoup(content, "html.parser") 
element = soup.find("td", 
{ 
    "class":"listtable_1", 
    "target":"_blank" 
}) 
print(element.text) 

回答

5

有很多這樣的條目在該HTML中。爲了讓所有的人都可以使用下列內容:

import requests 
from lxml import html 
from bs4 import BeautifulSoup 

r = requests.get("http://ppm.rep.tf/index.php?p=banlist&page=154") 
soup = BeautifulSoup(r.content, "html.parser") 

for td in soup.findAll("td", class_="listtable_1"): 
    for a in td.findAll("a", href=True, target="_blank"): 
     print(a.text) 

這則返回:

76561198143466239 
76561198094114508 
76561198053422590 
76561198066478249 
76561198107353289 
76561198043513442 
76561198128253254 
76561198134729239 
76561198003749039 
76561198091968935 
76561198071376804 
76561198068375438 
76561198039625269 
76561198135115106 
76561198096243060 
76561198067255227 
76561198036439360 
76561198026089333 
76561198126749681 
76561198008927797 
76561198091421170 
76561198122328638 
76561198104586244 
76561198056032796 
76561198059683068 
76561197995961306 
76561198102013044 
3

"target":"_blank"td標籤中的類錨標記a的。這不是一類td標記。

你可以得到它,像這樣:

from bs4 import BeautifulSoup 

html=""" 
<td height="16" class="listtable_1"> 
    <a href="http://steamcommunity.com/profiles/76561198134729239" target="_blank"> 
     76561198134729239 
    </a> 
</td>""" 

soup = BeautifulSoup(html, 'html.parser') 

print(soup.find('td', {'class': "listtable_1"}).find('a', {"target":"_blank"}).text) 

輸出:

76561198134729239 
2

"class":"listtable_1"屬於td標籤和target="_blank"屬於a標籤,你不應該使用它們在一起。

您應該使用Steam Community作爲查找其後的數字的錨點。 enter image description here

,或者使用網址,該網址包含您需要的信息,很容易找到,你可以找到的網址,並通過/把它分解:

for a in soup.find_all('a', href=re.compile(r'steamcommunity')): 
    num = a['href'].split('/')[-1] 
    print(num) 

代碼:

import requests 
from lxml import html 
from bs4 import BeautifulSoup 
r = requests.get("http://ppm.rep.tf/index.php?p=banlist&page=154") 
content = r.content 
soup = BeautifulSoup(content, "html.parser") 
for td in soup.find_all('td', string="Steam Community"): 
    num = td.find_next_sibling('td').text 
    print(num) 

出:

76561198143466239 
76561198094114508 
76561198053422590 
76561198066478249 
76561198107353289 
76561198043513442 
76561198128253254 
76561198134729239 
76561198003749039 
76561198091968935 
76561198071376804 
76561198068375438 
76561198039625269 
76561198135115106 
76561198096243060 
76561198067255227 
76561198036439360 
76561198026089333 
76561198126749681 
76561198008927797 
76561198091421170 
76561198122328638 
76561198104586244 
76561198056032796 
76561198059683068 
76561197995961306 
76561198102013044 
3

正如別人提到你ar e嘗試檢查單個find()中不同元素的屬性。相反,你可以鏈find()電話爲MYGz建議,或使用一個CSS selector

soup.select_one("td.listtable_1 a[target=_blank]").get_text() 

如果您需要找到多個元素這種方式,使用select()

for elm in soup.select("td.listtable_1 a[target=_blank]"): 
    print(elm.get_text())