如何將unicode文本轉換爲普通文本

我正在學習Python中的Beautiful Soup。如何將unicode文本轉換爲普通文本

我想解析一個簡單的網頁與書籍列表。

E.g

<a href="https://www.nostarch.com/carhacking">The Car Hacker’s Handbook</a>

我用下面的代碼。

import requests, bs4 
res = requests.get('http://nostarch.com') 
res.raise_for_status() 
nSoup = bs4.BeautifulSoup(res.text,"html.parser") 
elems = nSoup.select('.product-body a') 

#elems[0] gives 
<a href="https://www.nostarch.com/carhacking">The Car Hacker\u2019s Handbook</a>

而且

#elems[0].getText() gives 
u'The Car Hacker\u2019s Handbook'

但我想這是通過給予適當的文字，

s = elems[0].getText() 
print s 
>>>The Car Hacker’s Handbook

如何修改我的代碼，以便給「轎車黑客手冊」輸出，而不是「你的車黑客手冊」？

請幫忙。

來源

2016-04-14 CS_noob

你得到的結果沒有錯。它是一個帶有花哨字符的unicode字符串。 – Selcuk

謝謝，@Selcuk。但如何使用該字符串「u'The Car Hacker's Handbook'」並存儲在文件/數據庫中？它會被妥善保存嗎？我的意思是我嘗試了'f.write（elems [0] .getText（））'，我得到了UnicodeEncodeError。 –

謝謝，@Selcuk。我知道了。我用'elems [0] .getText（）。encode（'utf-8'）'保存到文件或數據庫中。 –

您是否嘗試過使用編碼方法？有關Unicode和Python

elems[0].getText().encode('utf-8')

更多信息可以在https://docs.python.org/2/howto/unicode.html

此外，被發現發現，如果你的字符串是真正的UTF-8編碼，您可以使用chardet並運行以下命令：

>>> import chardet 
>>> chardet.detect(elems[0].getText()) 
{'confidence': 0.5, 'encoding': 'utf-8'}

來源

2016-04-14 13:07:55 mschuh

謝謝。我試過'elems [0] .getText（）。encode（'utf-8'）'。有效。 Python終端將其打印爲「Car Hacker \ xe2 \ x80 \ x99s Handbook」，但如果寫入文件，文件內容中包含「The Car Hacker's Handbook」。 –

很酷。我只是爲了正確而編輯答案。 – mschuh

@madhusudan_k歡迎來到SO。如果您認爲通過此答案解決了您要查找的內容，請不要忘記單擊投票計數下方的箭頭接受答案。 – Blaszard

-2

你可以試試

import unicodedata 

def normText(unicodeText): 
return unicodedata.normalize('NFKD', unicodeText).encode('ascii','ignore')

這將轉換unicodetext爲純文本，您可以寫入文件。

來源

2016-04-14 14:29:11

它還刪除了「撇號」，因此書名變成了「The Car Hackers Handbook」。 – BlackJack

如何將unicode文本轉換爲普通文本

回答

相關問題