malformed start tag error - Python，BeautifulSoup和Sipie - Ubuntu 10.04

我剛剛在我的Ubuntu 10.04機器上安裝了python，mplayer，beautifulsoup和sipie來運行Sirius。我跟蹤了一些看似簡單的文檔，但遇到了一些問題。我對Python並不熟悉，所以這可能會超出我的聯盟。malformed start tag error - Python，BeautifulSoup和Sipie - Ubuntu 10.04

我能得到的一切安裝完畢，但隨後運行sipie給出了這樣的：

/usr/bin/Sipie/Sipie/Config.py:12: DeprecationWarning: the md5 module is deprecated; use hashlib instead import md5
Traceback (most recent call last): File "/usr/bin/Sipie/sipie.py", line 22, in <module> Sipie.cliPlayer()
File "/usr/bin/Sipie/Sipie/cliPlayer.py", line 74, in cliPlayer completer = Completer(sipie.getStreams())
File "/usr/bin/Sipie/Sipie/Factory.py", line 374, in getStreams streams = self.tryGetStreams()
File "/usr/bin/Sipie/Sipie/Factory.py", line 298, in tryGetStreams soup = BeautifulSoup(data)
File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.py", line 1499, in __init__ BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.py", line 1230, in __init__ self._feed(isHTML=isHTML)
File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.py", line 1263, in _feed self.builder.feed(markup)
File "/usr/lib/python2.6/HTMLParser.py", line 108, in feed self.goahead(0)
File "/usr/lib/python2.6/HTMLParser.py", line 148, in goahead k = self.parse_starttag(i)
File "/usr/lib/python2.6/HTMLParser.py", line 226, in parse_starttag endpos = self.check_for_whole_start_tag(i)
File "/usr/lib/python2.6/HTMLParser.py", line 301, in check_for_whole_start_tag self.error("malformed start tag")
File "/usr/lib/python2.6/HTMLParser.py", line 115, in error raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: malformed start tag, at line 100, column 3

我通過這些文件和行號看，但由於我不熟悉Python的，它並沒有太大的意義。有關下一步做什麼的建議？

來源

2010-07-07 nicorellius

您遇到的問題很常見，它們專門處理錯誤形成的HTML。就我而言，有一個HTML元素已經雙引用了一個屬性的值。實際上，我今天遇到了這個問題，這樣做的時候遇到了你的帖子。我終於能穿過html5lib解析HTML交給它關閉BeautifulSoup 4

首先之前解決這個問題，你需要：

sudo easy_install bs4 
sudo apt-get install python-html5lib

然後，運行此示例代碼：

from bs4 import BeautifulSoup 
import html5lib 
from html5lib import sanitizer 
from html5lib import treebuilders 
import urllib 

url = 'http://the-url-to-scrape' 
fp = urllib.urlopen(url) 

# Create an html5lib parser. Not sure if the sanitizer is required. 
parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("beautifulsoup"), tokenizer=sanitizer.HTMLSanitizer) 
# Load the source file's HTML into html5lib 
html5lib_object = parser.parse(file_pointer) 
# In theory we shouldn't need to convert this to a string before passing to BS. Didn't work passing directly to BS for me however. 
html_string = str(html5lib_object) 

# Load the string into BeautifulSoup for parsing. 
soup = BeautifulSoup(html_string) 

for content in soup.findAll('div'): 
    print content

如果您對此代碼有任何疑問或需要更詳細的指導，請告訴我。:)

來源

2012-02-10 18:22:04

我得到'ValueError：無法識別的treebuilder「beautifulsoup」' （Python 2.7.5，beautifulsoup 4.3.2，html5lib 0.999） – 2014-03-16 16:20:43

-2

看在被在文件「/usr/bin/Sipie/Sipie/Factory.py」中提到的「數據」線100的第3列，行298

來源

2010-07-07 21:23:27

我明白你的意思了，但我很難找到這些數據...仍然在搜索。仍然不熟悉所有這些程序如何協同工作......任何其他提示？ – nicorellius 2010-07-08 14:56:20

較新版本BeautifulSoup uses HTMLParser rather than SGMLParser的（由於從Python 3.0標準庫中刪除SGMLParser）。因此，BeautifulSoup不能再正確處理許多格式不正確的HTML文檔，這是我相信你在這裏遇到的。

一個解決問題的方法很可能是uninstall BeautifulSoup, and install an older version（這仍將在Ubuntu 10.04LTS與Python 2.6工作）：

sudo apt-get remove python-beautifulsoup 
sudo easy_install -U "BeautifulSoup==3.0.7a"

要知道，這種臨時解決方案將不再使用Python 3.0工作（這可能會成爲未來Ubuntu版本的默認設置）。

來源

2010-08-29 04:09:49

假設你正在使用BeautifulSoup4，我發現了正式文件中有關的內容：http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

If you’re using a version of Python 2 earlier than 2.7.3, or a version of Python 3 earlier than 3.2.2, it’s essential that you install lxml or html5lib–Python’s built-in HTML parser is just not very good in older versions.

我想這和它工作得很好，就像什麼@Joshua

soup = BeautifulSoup(r.text, 'html5lib')

來源

2012-04-30 03:11:10 Drake

+1，很好找！ – 2012-09-26 14:47:05

上述代碼中的「r」是來自請求庫的html對象嗎？無論如何，這個偉大的oneliner也像使用pycurl庫一樣具有魅力。 +1 – FredTheWebGuy 2013-07-17 06:46:47

@Dreadful_Code：r = requests.get（url） – dannyroa 2013-09-17 18:06:15

命令行：

$ pip install beautifulsoup4 
$ pip install html5lib

的Python 3：

from bs4 import BeautifulSoup 
from urllib.request import urlopen 

url = 'http://www.example.com' 
page = urlopen(url) 
soup = BeautifulSoup(page.read(), 'html5lib') 
links = soup.findAll('a') 

for link in links: 
    print(link.string, link['href'])

來源

2014-03-16 16:52:57

@ Ryan Allen我也收到了格式不正確的開始標記消息，但我需要用保存到磁盤的html文件解析，而不是打開的URL。有沒有辦法做到這一點？ – ShaunO 2017-06-30 19:59:02

您只需打開文件而不是使用urlopen。 'page = open（'your/file/path /'）' – 2017-07-05 17:49:42

malformed start tag error - Python，BeautifulSoup和Sipie - Ubuntu 10.04

回答

相關問題