2013-01-23 147 views
5

我有多個div一個HTML頁面一樣Python:如何使用BeautifulSoup從HTML頁面中提取URL?

<div class="article-additional-info"> 
A peculiar situation arose in the Supreme Court on Tuesday when two lawyers claimed to be the representative of one of the six accused in the December 16 gangrape case who has sought shifting of t... 
<a class="more" href="http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece"> 
<span class="arrows">»</span> 
</a> 
</div> 

<div class="article-additional-info"> 
Power consumers in the city will have to brace for spending more on their monthly bills as all three power distribution companies – the Anil Ambani-owned BRPL and BYPL and the Tatas-owned Tata Powe... 
<a class="more" href="http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece"> 
<a class="commentsCount" href="http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece#comments"> 
</div> 

,我需要得到<a href=>值所有article-additional-info類 的div我是新來BeautifulSoup

所以我需要的URL

"http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece" 
"http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece" 

實現此目的的最佳方法是什麼?

回答

8

根據您的標準,它會返回三個URL(不是兩個) - 您是否想過濾出第三個?

基本思想是通過HTML迭代,在你的類拉出只有那些元素,然後通過所有該類別中的鏈接的迭代,拿出實際的鏈接:

In [1]: from bs4 import BeautifulSoup 

In [2]: html = # your HTML 

In [3]: soup = BeautifulSoup(html) 

In [4]: for item in soup.find_all(attrs={'class': 'article-additional-info'}): 
    ...:  for link in item.find_all('a'): 
    ...:   print link.get('href') 
    ...:   
http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece 
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece 
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece#comments 

這可以限制只搜索那些具有article-additional-info類標記的元素,並在其中查找所有錨點(a)標記並抓取其相應的href鏈接。

2
from bs4 import BeautifulSoup as BS 
html = # Your HTML 
soup = BS(html) 
for text in soup.find_all('div', class_='article-additional-info'): 
    for links in text.find_all('a'): 
     print links.get('href') 

它打印:

http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece  
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece  
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece#comments 
2

與文檔的工作後,我做了以下方式,謝謝大家對你的答案,我很欣賞他們

>>> import urllib2 
>>> f = urllib2.urlopen('http://www.thehindu.com/news/cities/delhi/?union=citynews') 
>>> soup = BeautifulSoup(f.fp) 
>>> for link in soup.select('.article-additional-info'): 
... print link.find('a').attrs['href'] 
... 
http://www.thehindu.com/news/cities/Delhi/airport-metro-express-is-back/article4335059.ece 
http://www.thehindu.com/news/cities/Delhi/91-more-illegal-colonies-to-be-regularised/article4335069.ece 
http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece 
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece 
http://www.thehindu.com/news/cities/Delhi/nurses-women-groups-demand-safety-audit-of-workplaces/article4331470.ece 
http://www.thehindu.com/news/cities/Delhi/delhi-bpl-families-to-get-12-subsidised-lpg-cylinders/article4328990.ece 
http://www.thehindu.com/news/cities/Delhi/shias-condemn-violence-against-religious-minorities/article4328276.ece 
http://www.thehindu.com/news/cities/Delhi/new-archbishop-of-delhi-takes-over/article4328284.ece 
http://www.thehindu.com/news/cities/Delhi/delhi-metro-to-construct-subway-without-disrupting-traffic/article4328290.ece 
http://www.thehindu.com/life-and-style/Food/going-for-the-kill-in-patparganj/article.ece 
http://www.thehindu.com/news/cities/Delhi/fire-at-janpath-bhavan/article4335068.ece 
http://www.thehindu.com/news/cities/Delhi/fiveyearold-girl-killed-as-school-van-overturns/article4335065.ece 
http://www.thehindu.com/news/cities/Delhi/real-life-stories-of-real-women/article4331483.ece 
http://www.thehindu.com/news/cities/Delhi/women-councillors-allege-harassment-by-male-councillors-of-rival-parties/article4331471.ece 
http://www.thehindu.com/news/cities/Delhi/airport-metro-resumes-today/article4331467.ece 
http://www.thehindu.com/news/national/hearing-today-on-plea-to-shift-trial/article4328415.ece 
http://www.thehindu.com/news/cities/Delhi/protestors-demand-change-in-attitude-of-men-towards-women/article4328277.ece 
http://www.thehindu.com/news/cities/Delhi/bjp-promises-5-lakh-houses-for-poor-on-interestfree-loans/article4328280.ece 
http://www.thehindu.com/life-and-style/metroplus/papad-bidi-and-a-dacoit/article4323219.ece 
http://www.thehindu.com/life-and-style/Food/gharana-of-food-not-just-music/article4323212.ece 
>>> 
0
In [4]: for item in soup.find_all(attrs={'class': 'article-additional-info'}): 
...:  for link in item.find_all('a'): 
...:   print link.get('href') 
...: 
http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece  
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece  
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece#comments 
+1

請請勿再次鏈接到您自己的網站,因此它[[垃圾郵件**](http://stackoverflow.com/help/promotion)適用於[so]。 –

相關問題