2014-03-05 38 views
1

我正在嘗試將'& nbsp'添加到Beautifulsoup標記中。 BS將tag.string轉換爲\&ampamp;nbsp;而不是&nbsp。它必須是一些編碼問題,但我無法弄清楚。如何在Beautifulsoup標籤中插入空格(&nbsp)?

請注意:忽略後面的'\'字符。我不得不添加它,所以stackoverflow會正確地格式化我的問題。

import bs4 as Beautifulsoup 

html = "<td><span></span></td>" 
soup = Beautifulsoup(html) 
tag = soup.find("td") 
tag.string = "&nbsp;" 

當前輸出是html =「\ ampamp; nbsp;」

任何想法?

+0

你是如何打印輸出? – shaktimaan

回答

0

默認情況下,BeautifulSoup使用minimal輸出格式化程序並轉換HTML實體。

的解決方案是設置爲從BS源(PageElement文檔字符串)輸出格式化到None,報價:

# There are five possible values for the "formatter" argument passed in 
# to methods like encode() and prettify(): 
# 
# "html" - All Unicode characters with corresponding HTML entities 
# are converted to those entities on output. 
# "minimal" - Bare ampersands and angle brackets are converted to 
# XML entities: &amp; &lt; &gt; 
# None - The null formatter. Unicode characters are never 
# converted to entities. This is not recommended, but it's 
# faster than "minimal". 

實施例:

from bs4 import BeautifulSoup 


html = "<td><span></span></td>" 
soup = BeautifulSoup(html, 'html.parser') 
tag = soup.find("span") 
tag.string = '&nbsp;' 

print soup.prettify(formatter=None) 

打印:

<td> 
<span> 
    &nbsp; 
</span> 
</td> 

希望有所幫助。

+0

完美!我也找到了答案。謝謝你的回答! –

-1

您需要添加的Unicode非打破空間,它可以在Python中表示爲 「\ XA0」:

soup = BeautifulSoup("", "html5lib") # html5lib will add html and body tags by default 
soup.body.string = "\xa0" # uncode non-breaking space 
soup.encode("ascii") # to see final html in ascii encoding 

結果:

b'<html><head></head><body>&#160;</body></html>'