我有一個HTML文檔,我想用普通引號替換所有的智能引號。我嘗試這樣做:替換Beautiful Soup中的所有智能引號
for text_element in html.findAll():
content = text_element.string
if content:
new_content = content \
.replace(u"\u2018", "'") \
.replace(u"\u2019", "'") \
.replace(u"\u201c", '"') \
.replace(u"\u201d", '"') \
.replace("e", "x")
text_element.string.replaceWith(new_content)
(與E/X改造只是爲了可以很容易地看到,如果事情是工作或沒有)
,但是這是我的輸出:
<p>
This amount of investment is producing results: total final consumption in IEA countries is estimated to be
<strong>
60% lowxr
</strong>
today because of energy efficiency improvements over the last four decades. This has had the effect of
<strong>
avoiding morx xnxrgy consumption than thx total final consumption of thx Europxan Union in 2011
</strong>
.
</p>
它似乎BS正在深入到child-est標籤,但我需要獲取整個頁面中的所有文本。
如果調用'new_content = str(html).replace(u「\ u2018」,「'」).replace(...'? – jinksPadlock
),會發生什麼?問題不在於替換部分 - 它正在工作正確地在孩子元素,不擊中父母 – thumbtackthief
這就是爲什麼我想知道如果你只是把它叫做整個湯會發生什麼?或許我錯過了什麼? – jinksPadlock