BeautifulSoup格式不正確的開始標籤？

我正在嘗試將Wordpress XML轉換爲Octopress，部分使用BeautifulSoup進行遷移。BeautifulSoup格式不正確的開始標籤？

當我運行exitwp，我得到下面的輸出：

writing......................................................Traceback (most recent call last): 


File "exitwp.py", line 293, in <module> 
    write_jekyll(data, target_format) 
    File "exitwp.py", line 284, in write_jekyll 
    out.write(html2fmt(i['body'], target_format)) 
    File "exitwp.py", line 45, in html2fmt 
    return html2text(html, '') 
    File "/Users/kevinquillen/Documents/workspace/exitwp2/html2text.py", line 700, in html2text 
    return optwrap(html2text_file(html, None, baseurl)) 
    File "/Users/kevinquillen/Documents/workspace/exitwp2/html2text.py", line 695, in html2text_file 
    h.feed(html) 
    File "/Users/kevinquillen/Documents/workspace/exitwp2/html2text.py", line 285, in feed 
    HTMLParser.HTMLParser.feed(self, data) 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/HTMLParser.py", line 108, in feed 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/HTMLParser.py", line 148, in goahead 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/HTMLParser.py", line 229, in parse_starttag 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/HTMLParser.py", line 304, in check_for_whole_start_tag 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/HTMLParser.py", line 115, in error 
HTMLParser.HTMLParseError: malformed start tag, at line 1, column 64

我使用BeautifulSoup 3.2.0和3.0.7a沒有多少運氣嘗試。

我也嘗試在文章上導出不同的日期範圍，但仍然在第1行得到相同的錯誤，但是列號更改。

我唯一能想到的就是一些較舊的帖子在其中包含adsense代碼，但除此之外，我怎樣才能輕鬆追蹤它對內容的窒息？

的Python在OSX 10.7版本2.7

編輯：也發生在一個沒有壞標記頁面轉儲（僅2頁）。

更新：它似乎不喜歡錨標籤。標籤如下所示，非常基本的內容鏈接。刪除它們，它編譯正確。它爲什麼不喜歡這個HTML？刪除它們會導致它無錯地編譯。

<a href="http://www.google.com" target="_blank">Google</a>

來源

2012-01-03 Kevin

你可以添加一些xml的例子不適合你嗎？ – jcollado 2012-01-03 08:29:16

修改這樣的代碼（在html2text.py）：

try: 
    HTMLParser.HTMLParser.feed(self, data) 
except: 
    print 'malformed data: %r' % data 
    raise

我想你會看到，該 '數據' 包含一些奇怪的事情。如果沒有，請將數據添加到您的問題。

來源

2012-01-03 08:34:22 guettli

它似乎不喜歡錨標籤。標籤像Google非常基本。刪除它們，它編譯正確。它爲什麼不喜歡這個HTML？ – Kevin 2012-01-03 13:18:03

當然，BeautifulSoup可以解析錨標籤。我經常使用它。別的東西最容易被打破。請發佈您的數據。屬性值是否包含換行符或「<」字符？ – guettli 2012-01-03 14:29:42

奇怪的是我把它們放回XML文件並再次運行，這次沒有錯誤。我會嘗試從WordPress的同一個轉儲，然後再試一次。 – Kevin 2012-01-03 14:55:51

BeautifulSoup格式不正確的開始標籤？

回答

相關問題