BeautifulStoneSoup - 如何unescape和添加結束標記

我正在編輯原始帖子在這裏澄清，並希望我已煮沸成更容易管理的東西。我有XML字符串，看起來像：BeautifulStoneSoup - 如何unescape和添加結束標記

<foo id="foo"> 
    <row> 
     &lt;img alt="jules.png" src="http://localhost/jules.png" height="1024" width="764"&gt; 
    </row> 
    <row> 
     &lt;img alt="hairfire.png" src="http://localhost/hairfire.png" height="225" width="225"&gt; 
    </row> 
</foo>

所以，我做這樣的事情：

xml = BeautifulStoneSoup(someXml, selfClosingTags=['img'], convertEntities=BeautifulSoup.HTML_ENTITIES)

的，其結果是一樣的東西：

<foo id="foo"> 
    <row> 
     <img alt="jules.png" src="http://localhost/jules.png" height="1024" width="764"> 
    </row> 
    <row> 
     <img alt="hairfire.png" src="http://localhost/hairfire.png" height="225" width="225"> 
    </row> 
</foo>

公告每個img標籤上都沒有結束標籤。不知道這是我的問題，但可能。當我嘗試做：

images = xml.findAll('img')

這是產生一個空的列表。任何想法爲什麼BeautifulStoneSoup在這個xml片段中找不到我的圖片？

來源

2011-09-23 Greg

的原因你沒有找到的img標籤是因爲BeautifulSoup被他們當作了「行」標籤的文本部分。轉換實體只是改變字符串，它不會改變文檔的底層結構。以下不是很好的解決方案（它解析文檔兩次），但是當我在您的示例xml中測試它時它工作正常。這裏的想法是將文本轉換爲不好的xml，然後用美麗的湯再次清理它。

soup = BeautifulSoup(BeautifulSoup(text,convertEntities=BeautifulSoup.HTML_ENTITIES).prettify()) 
print soup.findAll('img')

來源

2011-10-03 15:28:56 Doran

BeautifulStoneSoup - 如何unescape和添加結束標記

回答

相關問題