需要將所有文本轉換爲純文本/ ASCII（我認爲？）

我試圖從我工作的網站上抓取一個故事，當您放入網址後，然後發佈給我們擁有的各種新聞合作伙伴。問題在於，特殊字符似乎在給它打嗝。我正在嘗試替換字符串，但它似乎並沒有特別出色。需要將所有文本轉換爲純文本/ ASCII（我認爲？）

有沒有辦法強制輸出是完全正常的文本，應該可以在任何地方？喜歡，沒有特殊字符？

我當前的代碼是：

from __future__ import division 
#from __future__ import unicode_literals 
from __future__ import print_function 
import spynner 
from mechanize import Browser 
import SendKeys 
from BeautifulSoup import BeautifulSoup 

br = Browser() 
url = "http://www.benzinga.com/trading-ideas/long-ideas/11/07/1815251/bargain-hunting-for-mid-caps-five-stocks-worth-taking-a-look-" 
page = br.open(url) 
html = page.read() 
soup = BeautifulSoup(html) 

artcontent = soup.find('div', {'class': 'article-content'}) 

title = artcontent.find('h1', {'id': 'title'}) 

title = title.string 

try: 
    title = title.replace("&#039;", "'") 
except: 
    pass 

authorname = artcontent.find('div', {'class': 'node full'}) 
authorname = authorname.find('div', {'class': 'article-submitted'}) 
authorname = authorname.find('div', {'class': 'info'}) 
authorname = authorname.find('a') 
authorname = authorname.string 

story = artcontent.find('div', {'class': 'node full'}) 
story = story.find('div', {'class': 'content clear-block'}) 
story = story.findAll('p', {'class': None}) 

#story = [str(x).replace("<p>","\n\n").replace("</p>","") for x in story] 

story = [str(x) for x in story] 

storyunified = ''.join(story) 

#try: 
# storyunified = storyunified.strip("\n") 
#except: 
# pass 
#try: 
# storyunified = storyunified.strip("\n") 
#except: 
# pass 

#print(storyunified) 

try: 
storyunified = storyunified.replace("Â", "") 
except: 
    pass 

try: 
    storyunified = storyunified.replace("â€", "\'") 
except: 
    pass 

try: 
    storyunified = storyunified.replace('「', '\"') 
except: 
    pass 

try: 
    storyunified = storyunified.replace('"', '\"') 
except: 
    pass 

try: 
    storyunified = storyunified.replace('」', '\"') 
except: 
    pass 

try: 
    storyunified = storyunified.replace("âﾀ", "") 
except: 
    pass 

try: 
    storyunified = storyunified.replace("â€", "") 
except: 
    pass

正如你所看到的，我試圖手動擺脫他們，而它並不總是似乎工作。

然後我嘗試使用Spynner發佈，但是我沒有看到該代碼至關重要。我張貼到福布斯博客。

來源

2011-08-08 Andrew Alexander

請看看這篇文章，看看你已經熟悉它討論的原則：http://www.joelonsoftware.com/articles/Unicode.html

我的直覺是，你的新聞合作伙伴都能夠接受的文本超過正是ASCII可以編碼。你只需要確保你的應用程序正確處理字符串和字節串，並且一切都應該自然地工作。

在Python 2.x中，'this text'是一個字節字符串，u'this text'是一個字符串。在Python 3.x中，'this text'是一個字符串，而b'this text'是一個字節串。字節串有一個.decode(encoding)方法，字符串有一個.encode(encoding)方法。

祝你好運！

來源

2011-08-08 15:46:54 wberry

有一天，我在Python中使用字符編碼進行了摔跤。

試試這個：

import unicodedata 

storyunified = unicodedata.normalize('NFKD', storyunified).encode('ascii','ignore').decode("ascii")

一件事不在於它會刪除有問題的字符，而不是取代它們。要改變這種行爲，您可以將ignore更改爲replace，但我沒有對此進行任何測試。

來源

2011-08-08 15:01:48

需要將所有文本轉換爲純文本/ ASCII（我認爲？）

回答

相關問題