2013-02-25 95 views
0

我有一個簡單的RSS源腳本,它將每篇文章的內容並通過一些簡單的處理運行它,然後將其保存在數據庫中。python - 掙扎與字符串編碼和重音引號/撇號

問題是,通過以下所有重音撇號運行文本並從文本中刪除引號。

# this is just an example string, I use feed_parser to download the feeds 
string = """&#160; <p>This is a sentence. This is a sentence. I'm a programmer. I&#8217;m a programmer, however I don&#8217;t graphic design.</p>""" 

text = BeautifulSoup(string) 
# does some simple soup processing 

string = text.renderContents() 
string = string.decode('utf-8', 'ignore') 
string = string.replace('<html>','') 
string = string.replace('</html>','') 
string = string.replace('<body>','') 
string = string.replace('</body>','') 
string = unicodedata.normalize('NFKD', string).encode('utf-8', 'ignore') 
print "".join([x for x in string if ord(x)<128]) 

導致:

> <p> </p><p>This is a sentence. This is a sentence. I'm a programmer. Im a programmer, however I dont graphic design.</p> 

所有的HTML實體報價/撇號被剝離出來。我該如何解決?

+0

你真的想用['feedparser'庫(https://code.google.com/p/feedparser /)處理RSS提要時。它將爲您清理大部分的feeditems,而無需手動替換標籤(可以用不同的方式完成)。 – 2013-02-25 21:20:19

+0

將它們去掉的東西是BeautifulSoup。這可能是設計。你能解釋一下你試圖通過使用它來完成什麼嗎? – entropy 2013-02-25 21:24:34

+0

示例代碼不適用於我,因爲'BeautifulSoup'不會擴展'&#NNNN;'逃逸。 – wRAR 2013-02-25 21:27:05

回答

1

下面的代碼工作對我來說,你可能錯過了convertEntities參數的構造函數BeautifulSoup的:

string = """&#160; <p>This is a sentence. This is a sentence. I'm a programmer. I&#8217;m a programmer, however I don&#8217;t graphic design.</p>""" 

text = BeautifulSoup(string, convertEntities=BeautifulSoup.HTML_ENTITIES) # See the converEntities argument 
# does some simple soup processing 

string = text.renderContents() 
string = string.decode('utf-8') 
string = string.replace('<html>','') 
string = string.replace('</html>','') 
string = string.replace('<body>','') 
string = string.replace('</body>','') 
# I don't know why your are doing this 
#string = unicodedata.normalize('NFKD', string).encode('utf-8', 'ignore') 
print string 
+1

我跑了你的建議,事實證明convertEntities實際上是貶值(我使用BS4),經過一些更多的搜索後發現輸出格式化程序和另一個解決方案,這可能會工作:http://stackoverflow.com/questions/11856011/beautifulsoup - 有 - 無屬性的HTML實體 – Joe 2013-02-25 21:58:11