當試圖用美麗的分析HTML時出現一個奇怪的問題

我試圖寫一些python代碼來收集官方網站的音樂圖表數據，但收集廣告牌數據時遇到了麻煩。我選擇beautifulsoup處理HTML當試圖用美麗的分析HTML時出現一個奇怪的問題

我ENV：蟒 -2.7 beautifulsoup-3.2.0

首先，我分析的HTML

>>> import BeautifulSoup, urllib2, re 
>>> html = urllib2.urlopen('http://www.billboard.com/charts/hot-100?page=1').read() 
>>> soup = BeautifulSoup.BeautifulSoup(html)

然後我嘗試收集我想要什麼數據，例如，藝術家姓名

HTML：

<div class="listing chart_listing"> 

<article id="node-1491420" class="song_review no_category chart_albumTrack_detail no_divider"> 
    <header> 
    <span class="chart_position position-down">11</span> 
      <h1>Ho Hey</h1> 
     <p class="chart_info"> 
     <a href="/artist/418560/lumineers">The Lumineers</a>   <br> 
     The Lumineers   </p>

藝術家的名是Lumineers

>>> print str(soup.find("div", {"class" : re.compile(r'\bchart_listing')})\ 
... .find("p", {"class":"chart_info"}).a.string) 
Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
AttributeError: 'NoneType' object has no attribute 'find'

NoneType！似乎它不能grep我想要的數據，也許我的規則是錯誤的，所以我嘗試grep一些基本的標籤。

>>> print str(soup.find("div")) 
None 
>>> print str(soup.find("a")) 
None 
>>> print str(soup.find("title")) 
<title>The Hot 100 : Page 2 | Billboard</title> 
>>> print str(soup) 
......entire HTML.....

我很困惑，爲什麼不能grep像div這樣的基本標籤？他們的確在那裏。我的代碼有什麼問題？當我嘗試用這些分析其他圖表時沒有任何問題。

來源

2013-03-02 Jash Yin

這似乎是一個Beautifulsoup 3問題。如果您美化（）輸出：

from BeautifulSoup import BeautifulSoup as soup3 
import urllib2, re 

html = urllib2.urlopen('http://www.billboard.com/charts/hot-100?page=1').read() 
soup = soup3(html) 
print soup.prettify()

你可以在輸出端看：

 <script type="text/javascript" src="//assets.pinterest.com/js/pinit.js"></script> 
</body> 
</html> 
    </script> 
</head> 
</html>

有兩個HTML結束標記，它看起來像BeautifulSoup3是由JavaScript的東西，在這個困惑數據。

如果你使用：

from bs4 import BeautifulSoup as soup4 
import urllib2, re 

html = urllib2.urlopen('http://www.billboard.com/charts/hot-100?page=1').read() 
soup = soup4(html) 
print str(soup.find("div", {"class" : re.compile(r'\bchart_listing')}).find("p", {"class":"chart_info"}).a.string)

你得到'The Lumineers'作爲輸出。

如果您不能切換到bs4，我建議您將html變量寫入文件out.txt，然後將腳本更改爲in.txt並將輸出複製到輸入並刪除塊。

from BeautifulSoup import BeautifulSoup as soup3 
import re 

html = open('in.txt').read() 
soup = soup3(html) 
print str(soup.find("div", {"class" : re.compile(r'\bchart_listing')}).find("p", {"class":"chart_info"}).a.string)

我的第一個猜測是要刪除<head> ... </head>，並創造奇蹟。

之後，你可以解決編程：

from BeautifulSoup import BeautifulSoup as soup3 
import urllib2, re 

htmlorg = urllib2.urlopen('http://www.billboard.com/charts/hot-100?page=1').read() 
head_start = htmlorg.index('<head') 
head_end = htmlorg.rindex('</head>') 
head_end = htmlorg.index('>', head_end) 
html = htmlorg[:head_start] + htmlorg[head_end+1:] 
soup = soup3(html) 
print str(soup.find("div", {"class" : re.compile(r'\bchart_listing')}).find("p", {"class":"chart_info"}).a.string)

來源

2013-03-03 08:35:09 Anthon

當試圖用美麗的分析HTML時出現一個奇怪的問題

回答

相關問題