是否有從utf8到latin-1和utf8中標準化非重音字母的映射?是否存在從utf8到latin-1的現有映射? Python
我已經越來越錯誤,如:
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u010d' in position 4: ordinal not in range(256)
而且我解決這些錯誤中的每一個通過執行以下代碼手動。有沒有更好的辦法做到這一點?:
def prehunpos(sentence):
sentence = sentence.replace(u'\u2018',"'") # left single quote mark
sentence = sentence.replace(u'\u2019',"'") # right single quote mark
sentence = sentence.replace(u'\u201C','"') # left double quote mark
sentence = sentence.replace(u'\u201D','"') # right double quote mark
sentence = sentence.replace(u'\u2010',"-") # hyphen
sentence = sentence.replace(u'\u2011',"-") # non-break hyphen
sentence = sentence.replace(u'\u2012',"-") # figure dash
sentence = sentence.replace(u'\u2013',"-") # dash
sentence = sentence.replace(u'\u2014',"-") # some sorta dash
sentence = sentence.replace(u'\u2015',"-") # long dash
sentence = sentence.replace(u'\u2017',"_") # double underscore
sentence = sentence.replace(u'\u2014',"-") # some sorta dash
sentence = sentence.replace(u'\u2016',"|") # long dash
sentence = sentence.replace(u'\u2024',"...") # ...
sentence = sentence.replace(u'\u2025',"...") # ...
sentence = sentence.replace(u'\u2026',"...") # ...
sentence = sentence.replace("\xce\x9d\xce\x91\xce\xa4\xce\x9f",u'NATO') # NATO
sentence = sentence.replace(u'\u0391',"A") # Greek Capital Alpha
sentence = sentence.replace(u'\u0392',"B") # Greek Capital Beta
#sentence = sentence.replace(u'\u0393',"") # Greek Capital Gamma
#sentence = sentence.replace(u'\u0394',"") # Greek Capital Delta
sentence = sentence.replace(u'\u0395',"E") # Greek Capital Epsilon
sentence = sentence.replace(u'\u0396',"Z") # Greek Capital Zeta
sentence = sentence.replace(u'\u0397',"H") # Greek Capital Eta
#sentence = sentence.replace(u'\u0398',"") # Greek Capital Theta
sentence = sentence.replace(u'\u0399',"I") # Greek Capital Iota
sentence = sentence.replace(u'\u039a',"K") # Greek Capital Kappa
#sentence = sentence.replace(u'\u039b',"") # Greek Capital Lambda
sentence = sentence.replace(u'\u039c',"M") # Greek Capital Mu
sentence = sentence.replace(u'\u039d',"N") # Greek Capital Nu
#sentence = sentence.replace(u'\u039e',"") # Greek Capital Xi
sentence = sentence.replace(u'\u039f',"O") # Greek Capital Omicron
sentence = sentence.replace(u'\u03a1',"P") # Greek Capital Rho
#sentence = sentence.replace(u'\u03a3',"") # Greek Capital Sigma
sentence = sentence.replace(u'\u03a4',"T") # Greek Capital Tau
sentence = sentence.replace(u'\u03a5',"Y") # Greek Capital Upsilon
#ssentence = sentence.replace(u'\u03a6',"") # Greek Capital Phi
sentence = sentence.replace(u'\u03a7',"T") # Greek Capital Chi
#sentence = sentence.replace(u'\u03a8',"") # Greek Capital Psi
#sentence = sentence.replace(u'\u03a9',"") # Greek Capital Omega
sentence = sentence.replace(u'\u03b1',"a") # Greek small alpha
sentence = sentence.replace(u'\u03b2',"b") # Greek small beta
#sentence = sentence.replace(u'\u03b3',"") # Greek small gamma
#sentence = sentence.replace(u'\u03b4',"") # Greek small delta
sentence = sentence.replace(u'\u03b5',"e") # Greek small epsilon
#sentence = sentence.replace(u'\u03b6',"") # Greek small zeta
#sentence = sentence.replace(u'\u03b7',"") # Greek small eta
#sentence = sentence.replace(u'\u03b8',"") # Greek small thetha
sentence = sentence.replace(u'\u03b9',"i") # Greek small iota
sentence = sentence.replace(u'\u03ba',"k") # Greek small kappa
#sentence = sentence.replace(u'\u03bb',"") # Greek small lamda
sentence = sentence.replace(u'\u03bc',"u") # Greek small mu
sentence = sentence.replace(u'\u03bd',"v") # Greek small nu
#sentence = sentence.replace(u'\u03be',"") # Greek small xi
sentence = sentence.replace(u'\u03bf',"o") # Greek small omicron
#sentence = sentence.replace(u'\u03c0',"") # Greek small pi
sentence = sentence.replace(u'\u03c1',"p") # Greek small rho
sentence = sentence.replace(u'\u03c2',"c") # Greek small final sigma
#sentence = sentence.replace(u'\u03c3',"") # Greek small sigma
sentence = sentence.replace(u'\u03c4',"t") # Greek small tau
sentence = sentence.replace(u'\u03c5',"u") # Greek small upsilon
#sentence = sentence.replace(u'\u03c6',"") # Greek small phi
sentence = sentence.replace(u'\u03c7',"x") # Greek small chi
sentence = sentence.replace(u'\u03c8',"x") # Greek small psi
sentence = sentence.replace(u'\u03c9',"w") # Greek small omega
sentence = sentence.replace(u'\u0103',"a") # Latin a with breve
sentence = sentence.replace(u'\u0107',"c") # Latin c with acute
sentence = sentence.replace(u'\u010d',"c") # Latin c with caron
sentence = sentence.replace(u'\u0161',"s") # Lation s with caron
return sentence.strip()
可能的重複[如何用Python中的ascii字符替換unicode字符(給出的perl腳本)?](http://stackoverflow.com/questions/2700859/how-to-replace-unicode-characters-by-ascii -characters-in-python-perl-script-giv) –
http://code.google.com/p/a2bot/source/browse/trunk/lib/unaccent.py –
@MikeSamuel,它不能解決問題瘋狂的utf8標點不能正常化。 – alvas