BeautifulSoup沒有正確讀取文檔

我試圖通過運行一些機器學習的意圖來刮掉NBA球員的統計數據，並且我發現這些「可打印的球員文件」有一堆統計數據很好，很整齊。不幸的是，我試圖使用BeautifulSoup來解析html，它根本不起作用。例如：BeautifulSoup沒有正確讀取文檔

from bs4 import BeautifulSoup 
import codecs 
import urllib2 

url = 'http://www.nba.com/playerfile/ray_allen/printable_player_files.html' 
html = urllib2.urlopen(url).read() 
soup = BeautifulSoup(html) 

with open('ray_allen.txt', 'w') as f: 
    f.write(soup.prettify()) 
    f.close()

讓我看起來像這樣的文件：

<html> 
<head> 
    <!--no description was found--> 
    <!--no title was found--> 
    <!--no keywords found--> 
    <!--not article--> 
    <script> 
    var site = "nba"; 
var page = "player"; 
    </script> 
    <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/> 
    <script language="Javascript"> 
    &lt;!-- 
var flashinstalled = 0; 
var flashversion = 0; 
MSDetect = "false"; 
if (navigator.plugins &amp;&amp; navigator.plugins.length) { 
    x = navigator.plugins["Shockwave Flash"]; 
    if (x) { 
     flashinstalle d  =  2 ; 

      i f  ( x . d e s c r i p t i o n )  { 

       y  =  x . d e s c r i p t i o n ; 

       f l a s h v e r s i o n  =  y . c h a r A t ( y . i n d e x O f ( ' . ' ) - 1 ) ; 

      } 

     }  e l s e 

      f l a s h i n s t a l l e d  =  1 ; 

     i f  ( n a v i g a t o r . p l u g i n s [ " S h o c k w a v e  F l a s h  2 . 0 " ] )  { 

      f l a s h i n s t a l l e d  =  2 ; 

      f l a s h v e r s i o n  =  2 ; 

     } 
[...]

與前整理，然後再延續3000+線（[...]是由我添加的）：

[...] 
    &lt; / b o d y &gt; 

    &lt; / h t m l &gt; 
    </script> 
</head> 
</html>

我也試過 'http://www.basketball-reference.com/players/a/allenra02.html'，而是和一個給了我這個錯誤：

Traceback (most recent call last): File "test.py", line 9, in f.write(soup.prettify()) UnicodeEncodeError: 'ascii' codec can't encode character u'\xb7' in position 6167: ordinal not in range(128)

也許我應該使用別的東西來解析html？或者這些問題之一是否易於解決？我在這裏讀到的東西似乎表明，使用BeautifulSoup應該讓我更容易，而不是很難！

編輯：行：

print soup.prettify()

作品在終端的第二頁，所以有一些事情，當它試圖寫入file--這不是一個問題BeautifulSoup

來源

2012-07-06 mavix

你運行的是什麼版本的Python？ – Falmarri 2012-07-06 00:58:12

BeautifulSoup的版本是什麼？我知道最近的一個有一些問題。 – Trickfire 2012-07-06 00:59:28

等一下。是什麼讓你覺得這不起作用？什麼是HTML文件的其餘部分？這就是我查看源代碼時html頁面的開始。 – Falmarri 2012-07-06 01:02:39

這對看起來像BeautifulSoup 4中的錯誤。

我試着用BeautifulSoup 3（在Ubuntu中打包）通過將from bs4 import BeautifulSoup更改爲from BeautifulSoup import BeautifulSoup，並按預期工作。當我使用v4（不改變你的代碼）時，我再現了你的問題。該錯誤似乎在解析器中，而不是在prettify中，因爲打印soup對象顯示相同的問題。

請把它作爲bug在https://bugs.launchpad.net/beautifulsoup/。與此同時，使用版本3.

來源

2012-07-06 02:08:21

這表現出與bug 972466相同的症狀，該症狀在4.0.3中得到修復。我建議升級到美麗的湯4的最新版本。

來源

2012-07-06 02:48:19

BeautifulSoup沒有正確讀取文檔

回答

相關問題