替換Beautiful Soup中的所有智能引號

我有一個HTML文檔，我想用普通引號替換所有的智能引號。我嘗試這樣做：替換Beautiful Soup中的所有智能引號

for text_element in html.findAll(): 
    content = text_element.string 
    if content: 
     new_content = content \ 
      .replace(u"\u2018", "'") \ 
      .replace(u"\u2019", "'") \ 
      .replace(u"\u201c", '"') \ 
      .replace(u"\u201d", '"') \ 
      .replace("e", "x") 
     text_element.string.replaceWith(new_content)

（與E/X改造只是爲了可以很容易地看到，如果事情是工作或沒有）

，但是這是我的輸出：

<p> 
This amount of investment is producing results: total final consumption in IEA countries is estimated to be 
    <strong> 
     60% lowxr 
    </strong> 
today because of energy efficiency improvements over the last four decades. This has had the effect of 
    <strong> 
     avoiding morx xnxrgy consumption than thx total final consumption of thx Europxan Union in 2011 
    </strong> 
. 
</p>

它似乎BS正在深入到child-est標籤，但我需要獲取整個頁面中的所有文本。

來源

2017-02-24 thumbtackthief

如果調用'new_content = str（html）.replace（u「\ u2018」，「'」）.replace（...'？ – jinksPadlock

），會發生什麼？問題不在於替換部分 - 它正在工作正確地在孩子元素，不擊中父母 – thumbtackthief

這就是爲什麼我想知道如果你只是把它叫做整個湯會發生什麼？或許我錯過了什麼？ – jinksPadlock

這工作，但也許有一個更清潔的方式：

for text_element in html.findAll(): 
    for child in text_element.contents: 
     if child: 
      content = child.string 
      if content: 
       new_content = remove_smart_quotes(content) 
       child.string.replaceWith(new_content)

來源

2017-02-24 18:09:43 thumbtackthief

而是選擇和過濾的所有元素/標籤，你可以只通過爲string argument指定True選擇文本節點直接：

for text_node in soup.find_all(string=True): 
    # do something with each text node

正如文檔所述，string參數在版本4.4.0中是新的，這意味着您可能需要使用text參數，而不是您的版本：

for text_node in soup.find_all(text=True): 
    # do something with each text node

這裏是替換值的相關代碼：

def remove_smart_quotes (text): 
    return text.replace(u"\u2018", "'") \ 
      .replace(u"\u2019", "'") \ 
      .replace(u"\u201c", '"') \ 
      .replace(u"\u201d", '"') 

soup = BeautifulSoup(html, 'lxml') 

for text_node in soup.find_all(string=True): 
    text_node.replaceWith(remove_smart_quotes(text_node))

作爲一個側面說明，美麗的湯文件實際上有一個section on smart quotes。

來源

2017-02-24 18:25:21

替換Beautiful Soup中的所有智能引號

回答

相關問題