我現在正在使用美麗的湯來解析網頁,我聽說它非常有名,很好,但它似乎並不正常。BeautifulSoup無法解析網頁?
這裏就是我所做的
import urllib2
from bs4 import BeautifulSoup
page = urllib2.urlopen("http://www.cnn.com/2012/10/14/us/skydiver-record-attempt/index.html?hpt=hp_t1")
soup = BeautifulSoup(page)
print soup.prettify()
我覺得這是一種直接的。我打開網頁並將其傳遞給美麗的裝置。但這裏是我的了:
Warning (from warnings module):
File "C:\Python27\lib\site-packages\bs4\builder\_htmlparser.py", line 149
"Python's built-in HTMLParser cannot parse the given document. This is not a bug in Beautiful Soup. The best solution is to install an external parser (lxml or html5lib), and use Beautiful Soup with that parser. See http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser for help."))
...
HTMLParseError: bad end tag: u'</"+"script>', at line 634, column 94
我認爲CNN網站應該精心設計的,所以我也不是很確定發生了什麼事情。有沒有人有這個想法?
我沒有安裝我的Python 2.7安裝BS4,但這個工程沒有在3.2和3.3的問題。 – poke