2010-07-07 35 views
9

我剛剛在我的Ubuntu 10.04機器上安裝了python,mplayer,beautifulsoup和sipie來運行Sirius。我跟蹤了一些看似簡單的文檔,但遇到了一些問題。我對Python並不熟悉,所以這可能會超出我的聯盟。malformed start tag error - Python,BeautifulSoup和Sipie - Ubuntu 10.04

我能得到的一切安裝完畢,但隨後運行sipie給出了這樣的:

/usr/bin/Sipie/Sipie/Config.py:12: DeprecationWarning: the md5 module is deprecated; use hashlib instead import md5
Traceback (most recent call last): File "/usr/bin/Sipie/sipie.py", line 22, in <module> Sipie.cliPlayer()
File "/usr/bin/Sipie/Sipie/cliPlayer.py", line 74, in cliPlayer completer = Completer(sipie.getStreams())
File "/usr/bin/Sipie/Sipie/Factory.py", line 374, in getStreams streams = self.tryGetStreams()
File "/usr/bin/Sipie/Sipie/Factory.py", line 298, in tryGetStreams soup = BeautifulSoup(data)
File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.py", line 1499, in __init__ BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.py", line 1230, in __init__ self._feed(isHTML=isHTML)
File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.py", line 1263, in _feed self.builder.feed(markup)
File "/usr/lib/python2.6/HTMLParser.py", line 108, in feed self.goahead(0)
File "/usr/lib/python2.6/HTMLParser.py", line 148, in goahead k = self.parse_starttag(i)
File "/usr/lib/python2.6/HTMLParser.py", line 226, in parse_starttag endpos = self.check_for_whole_start_tag(i)
File "/usr/lib/python2.6/HTMLParser.py", line 301, in check_for_whole_start_tag self.error("malformed start tag")
File "/usr/lib/python2.6/HTMLParser.py", line 115, in error raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: malformed start tag, at line 100, column 3

我通過這些文件和行號看,但由於我不熟悉Python的,它並沒有太大的意義。有關下一步做什麼的建議?

回答

8

您遇到的問題很常見,它們專門處理錯誤形成的HTML。就我而言,有一個HTML元素已經雙引用了一個屬性的值。實際上,我今天遇到了這個問題,這樣做的時候遇到了你的帖子。我終於能穿過html5lib解析HTML交給它關閉BeautifulSoup 4

首先之前解決這個問題,你需要:

sudo easy_install bs4 
sudo apt-get install python-html5lib 

然後,運行此示例代碼:

from bs4 import BeautifulSoup 
import html5lib 
from html5lib import sanitizer 
from html5lib import treebuilders 
import urllib 

url = 'http://the-url-to-scrape' 
fp = urllib.urlopen(url) 

# Create an html5lib parser. Not sure if the sanitizer is required. 
parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("beautifulsoup"), tokenizer=sanitizer.HTMLSanitizer) 
# Load the source file's HTML into html5lib 
html5lib_object = parser.parse(file_pointer) 
# In theory we shouldn't need to convert this to a string before passing to BS. Didn't work passing directly to BS for me however. 
html_string = str(html5lib_object) 

# Load the string into BeautifulSoup for parsing. 
soup = BeautifulSoup(html_string) 

for content in soup.findAll('div'): 
    print content 

如果您對此代碼有任何疑問或需要更詳細的指導,請告訴我。:)

+2

我得到'ValueError:無法識別的treebuilder「beautifulsoup」' (Python 2.7.5,beautifulsoup 4.3.2,html5lib 0.999) – 2014-03-16 16:20:43

-2

看在被在文件「/usr/bin/Sipie/Sipie/Factory.py」中提到的「數據」線100的第3列,行298

+0

我明白你的意思了,但我很難找到這些數據...仍然在搜索。仍然不熟悉所有這些程序如何協同工作......任何其他提示? – nicorellius 2010-07-08 14:56:20

2

較新版本BeautifulSoup uses HTMLParser rather than SGMLParser的(由於從Python 3.0標準庫中刪除SGMLParser)。因此,BeautifulSoup不能再正確處理許多格式不正確的HTML文檔,這是我相信你在這裏遇到的。

一個解決問題的方法很可能是uninstall BeautifulSoup, and install an older version(這仍將在Ubuntu 10.04LTS與Python 2.6工作):

sudo apt-get remove python-beautifulsoup 
sudo easy_install -U "BeautifulSoup==3.0.7a" 

要知道,這種臨時解決方案將不再使用Python 3.0工作(這可能會成爲未來Ubuntu版本的默認設置)。

15

假設你正在使用BeautifulSoup4,我發現了正式文件中有關的內容:http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

If you’re using a version of Python 2 earlier than 2.7.3, or a version of Python 3 earlier than 3.2.2, it’s essential that you install lxml or html5lib–Python’s built-in HTML parser is just not very good in older versions.

我想這和它工作得很好,就像什麼@Joshua

soup = BeautifulSoup(r.text, 'html5lib') 
+1

+1,很好找! – 2012-09-26 14:47:05

+0

上述代碼中的「r」是來自請求庫的html對象嗎?無論如何,這個偉大的oneliner也像使用pycurl庫一樣具有魅力。 +1 – FredTheWebGuy 2013-07-17 06:46:47

+1

@Dreadful_Code:r = requests.get(url) – dannyroa 2013-09-17 18:06:15

2

命令行:

$ pip install beautifulsoup4 
$ pip install html5lib 

的Python 3:

from bs4 import BeautifulSoup 
from urllib.request import urlopen 

url = 'http://www.example.com' 
page = urlopen(url) 
soup = BeautifulSoup(page.read(), 'html5lib') 
links = soup.findAll('a') 

for link in links: 
    print(link.string, link['href']) 
+0

@ Ryan Allen我也收到了格式不正確的開始標記消息,但我需要用保存到磁盤的html文件解析,而不是打開的URL。有沒有辦法做到這一點? – ShaunO 2017-06-30 19:59:02

+0

您只需打開文件而不是使用urlopen。 'page = open('your/file/path /')' – 2017-07-05 17:49:42