如何在使用Python中的BeautifulSoup處理網頁時，通過** class **或** div id **值過濾href鏈接？

我有不同的類和DIV屬性如下如何在使用Python中的BeautifulSoup處理網頁時，通過** class **或** div id **值過濾href鏈接？

<div id="meat">    
    <div id="tag_nav" class="font2 pad2"> 
Comics: 
<a id="tag_nav_random" href="/random">Random</a> 
<a id="tag_nav_popular" href="/tag/popular">Most Popular</a> 
<a href="/comics">All</a> 
<a href="/tag/cats">Cats</a> 
<a href="/tag/grammar">Grammar</a> 
<a href="/tag/food">Food</a> 
<a href="/tag/animals">Animals</a> 
<a href="/tag/tech">Tech</a>

<li> 
     <div class="bg_comic"> 
     <a href="/comics/mantis_shrimp"><img src="http://s3.amazonaws.com/theoatmeal-img/thumbnails/mantis_shrimp.png" alt="Why the mantis shrimp is my new favorite animal" class="border0" /></a> 
     </div> 
     <div class="category_and_view"> 
    </li>

我想提取我的HTML頁面中的所有鏈接只屬於類的html文件bg_comic並忽略其他標籤可能屬於不同的類別。我嘗試以下，但它不工作：

links=soup.find_all("a",class_="bg_comic") 
for tag in links: 
    link=tag.get('href',None)

在上面的例子中，鏈接變量的值應爲/漫畫/ mantis_shrimp而不是任何其他價值。但我的代碼不打印任何東西。

我在做什麼錯？我們如何使用BeautifulSoup處理網頁時通過類或div id過濾鏈接？

來源

2014-03-02 Ajim Bagwan

有沒有a標記與bg_comic類在html中，但div標記與bg_comic。

修改您的代碼如下將解決您的問題。

links = soup.find_all("div", class_="bg_comic") # a -> div 
for tag in links: 
    lilnk = tag.a.get('href', None) # tag.get -> tag.a.get

或者，你可以使用css selector：

links = soup.select("div.bg_comic a") 
for tag in links: 
    link = tag.get('href', None)

來源

2014-03-02 12:13:50 falsetru

非常感謝。它正在雙向工作。我是BeautifulSoup的新手，我試着尋找更多的例子來使用find_all並獲得湯的方法。但找不到任何。你可以請分享一些鏈接，我會得到更多關於Web提取和處理，模式匹配等信息嗎？ –

@AjimBagwan，我建議你閱讀[Python教程]（http://docs.python.org/tutorial）。之後？看看[現在是什麼？]（http://docs.python.org/tutorial/whatnow.html）。書？我不知道。 – falsetru

@AjimBagwan，對於BeautifulSoup，請參閱[BeautifulSoup documentaiton]（http://www.crummy.com/software/BeautifulSoup/bs4/doc）。 – falsetru

如何在使用Python中的BeautifulSoup處理網頁時，通過 class 或 div id 值過濾href鏈接？

回答

如何在使用Python中的BeautifulSoup處理網頁時，通過** class **或** div id **值過濾href鏈接？

回答

相關問題

如何在使用Python中的BeautifulSoup處理網頁時，通過 class 或 div id 值過濾href鏈接？