我試圖從我工作的網站上抓取一個故事,當您放入網址後,然後發佈給我們擁有的各種新聞合作伙伴。問題在於,特殊字符似乎在給它打嗝。我正在嘗試替換字符串,但它似乎並沒有特別出色。需要將所有文本轉換爲純文本/ ASCII(我認爲?)
有沒有辦法強制輸出是完全正常的文本,應該可以在任何地方?喜歡,沒有特殊字符?
我當前的代碼是:
from __future__ import division
#from __future__ import unicode_literals
from __future__ import print_function
import spynner
from mechanize import Browser
import SendKeys
from BeautifulSoup import BeautifulSoup
br = Browser()
url = "http://www.benzinga.com/trading-ideas/long-ideas/11/07/1815251/bargain-hunting-for-mid-caps-five-stocks-worth-taking-a-look-"
page = br.open(url)
html = page.read()
soup = BeautifulSoup(html)
artcontent = soup.find('div', {'class': 'article-content'})
title = artcontent.find('h1', {'id': 'title'})
title = title.string
try:
title = title.replace("'", "'")
except:
pass
authorname = artcontent.find('div', {'class': 'node full'})
authorname = authorname.find('div', {'class': 'article-submitted'})
authorname = authorname.find('div', {'class': 'info'})
authorname = authorname.find('a')
authorname = authorname.string
story = artcontent.find('div', {'class': 'node full'})
story = story.find('div', {'class': 'content clear-block'})
story = story.findAll('p', {'class': None})
#story = [str(x).replace("<p>","\n\n").replace("</p>","") for x in story]
story = [str(x) for x in story]
storyunified = ''.join(story)
#try:
# storyunified = storyunified.strip("\n")
#except:
# pass
#try:
# storyunified = storyunified.strip("\n")
#except:
# pass
#print(storyunified)
try:
storyunified = storyunified.replace("Â", "")
except:
pass
try:
storyunified = storyunified.replace("â€", "\'")
except:
pass
try:
storyunified = storyunified.replace('「', '\"')
except:
pass
try:
storyunified = storyunified.replace('"', '\"')
except:
pass
try:
storyunified = storyunified.replace('」', '\"')
except:
pass
try:
storyunified = storyunified.replace("âタ", "")
except:
pass
try:
storyunified = storyunified.replace("â€", "")
except:
pass
正如你所看到的,我試圖手動擺脫他們,而它並不總是似乎工作。
然後我嘗試使用Spynner發佈,但是我沒有看到該代碼至關重要。我張貼到福布斯博客。