python - 掙扎與字符串編碼和重音引號/撇號

我有一個簡單的RSS源腳本，它將每篇文章的內容並通過一些簡單的處理運行它，然後將其保存在數據庫中。python - 掙扎與字符串編碼和重音引號/撇號

問題是，通過以下所有重音撇號運行文本並從文本中刪除引號。

# this is just an example string, I use feed_parser to download the feeds 
string = """&#160; <p>This is a sentence. This is a sentence. I'm a programmer. I&#8217;m a programmer, however I don&#8217;t graphic design.</p>""" 

text = BeautifulSoup(string) 
# does some simple soup processing 

string = text.renderContents() 
string = string.decode('utf-8', 'ignore') 
string = string.replace('<html>','') 
string = string.replace('</html>','') 
string = string.replace('<body>','') 
string = string.replace('</body>','') 
string = unicodedata.normalize('NFKD', string).encode('utf-8', 'ignore') 
print "".join([x for x in string if ord(x)<128])

導致：

> <p> </p><p>This is a sentence. This is a sentence. I'm a programmer. Im a programmer, however I dont graphic design.</p>

所有的HTML實體報價/撇號被剝離出來。我該如何解決？

來源

2013-02-25 Joe

你真的想用['feedparser'庫（https://code.google.com/p/feedparser /）處理RSS提要時。它將爲您清理大部分的feeditems，而無需手動替換標籤（可以用不同的方式完成）。 – 2013-02-25 21:20:19

將它們去掉的東西是BeautifulSoup。這可能是設計。你能解釋一下你試圖通過使用它來完成什麼嗎？ – entropy 2013-02-25 21:24:34

示例代碼不適用於我，因爲'BeautifulSoup'不會擴展'&#NNNN;'逃逸。 – wRAR 2013-02-25 21:27:05

下面的代碼工作對我來說，你可能錯過了convertEntities參數的構造函數BeautifulSoup的：

string = """&#160; <p>This is a sentence. This is a sentence. I'm a programmer. I&#8217;m a programmer, however I don&#8217;t graphic design.</p>""" 

text = BeautifulSoup(string, convertEntities=BeautifulSoup.HTML_ENTITIES) # See the converEntities argument 
# does some simple soup processing 

string = text.renderContents() 
string = string.decode('utf-8') 
string = string.replace('<html>','') 
string = string.replace('</html>','') 
string = string.replace('<body>','') 
string = string.replace('</body>','') 
# I don't know why your are doing this 
#string = unicodedata.normalize('NFKD', string).encode('utf-8', 'ignore') 
print string

來源

2013-02-25 21:26:12 Xion345

我跑了你的建議，事實證明convertEntities實際上是貶值（我使用BS4），經過一些更多的搜索後發現輸出格式化程序和另一個解決方案，這可能會工作：http://stackoverflow.com/questions/11856011/beautifulsoup - 有 - 無屬性的HTML實體 – Joe 2013-02-25 21:58:11

python - 掙扎與字符串編碼和重音引號/撇號

回答

相關問題