2013-03-02 87 views
0

我試圖寫一些python代碼來收集官方網站的音樂圖表數據,但收集廣告牌數據時遇到了麻煩。我選擇beautifulsoup處理HTML當試圖用美麗的分析HTML時出現一個奇怪的問題

我ENV:蟒 -2.7 beautifulsoup-3.2.0

首先,我分析的HTML

>>> import BeautifulSoup, urllib2, re 
>>> html = urllib2.urlopen('http://www.billboard.com/charts/hot-100?page=1').read() 
>>> soup = BeautifulSoup.BeautifulSoup(html) 

然後我嘗試收集我想要什麼數據,例如,藝術家姓名

HTML:

<div class="listing chart_listing"> 

<article id="node-1491420" class="song_review no_category chart_albumTrack_detail no_divider"> 
    <header> 
    <span class="chart_position position-down">11</span> 
      <h1>Ho Hey</h1> 
     <p class="chart_info"> 
     <a href="/artist/418560/lumineers">The Lumineers</a>   <br> 
     The Lumineers   </p> 

藝術家的名是Lumineers

>>> print str(soup.find("div", {"class" : re.compile(r'\bchart_listing')})\ 
... .find("p", {"class":"chart_info"}).a.string) 
Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
AttributeError: 'NoneType' object has no attribute 'find' 

NoneType!似乎它不能grep我想要的數據,也許我的規則是錯誤的,所以我嘗試grep一些基本的標籤。

>>> print str(soup.find("div")) 
None 
>>> print str(soup.find("a")) 
None 
>>> print str(soup.find("title")) 
<title>The Hot 100 : Page 2 | Billboard</title> 
>>> print str(soup) 
......entire HTML..... 

我很困惑,爲什麼不能grep像div這樣的基本標籤?他們的確在那裏。我的代碼有什麼問題?當我嘗試用這些分析其他圖表時沒有任何問題。

回答

1

這似乎是一個Beautifulsoup 3問題。如果您美化()輸出:

from BeautifulSoup import BeautifulSoup as soup3 
import urllib2, re 

html = urllib2.urlopen('http://www.billboard.com/charts/hot-100?page=1').read() 
soup = soup3(html) 
print soup.prettify() 

你可以在輸出端看:

 <script type="text/javascript" src="//assets.pinterest.com/js/pinit.js"></script> 
</body> 
</html> 
    </script> 
</head> 
</html> 

有兩個HTML結束標記,它看起來像BeautifulSoup3是由JavaScript的東西,在這個困惑數據。

如果你使用:

from bs4 import BeautifulSoup as soup4 
import urllib2, re 

html = urllib2.urlopen('http://www.billboard.com/charts/hot-100?page=1').read() 
soup = soup4(html) 
print str(soup.find("div", {"class" : re.compile(r'\bchart_listing')}).find("p", {"class":"chart_info"}).a.string) 

你得到'The Lumineers'作爲輸出。

如果您不能切換到bs4,我建議您將html變量寫入文件out.txt,然後將腳本更改爲in.txt並將輸出複製到輸入並刪除塊。

from BeautifulSoup import BeautifulSoup as soup3 
import re 

html = open('in.txt').read() 
soup = soup3(html) 
print str(soup.find("div", {"class" : re.compile(r'\bchart_listing')}).find("p", {"class":"chart_info"}).a.string) 

我的第一個猜測是要刪除<head> ... </head>,並創造奇蹟。

之後,你可以解決編程:

from BeautifulSoup import BeautifulSoup as soup3 
import urllib2, re 

htmlorg = urllib2.urlopen('http://www.billboard.com/charts/hot-100?page=1').read() 
head_start = htmlorg.index('<head') 
head_end = htmlorg.rindex('</head>') 
head_end = htmlorg.index('>', head_end) 
html = htmlorg[:head_start] + htmlorg[head_end+1:] 
soup = soup3(html) 
print str(soup.find("div", {"class" : re.compile(r'\bchart_listing')}).find("p", {"class":"chart_info"}).a.string)