如何擺脫字符串中的一些字符？ .replace（）不起作用

我需要擺脫從XML文件中得到的字符串中的波蘭語字符。我使用.replace（），但在這種情況下它不起作用。爲什麼？代碼：如何擺脫字符串中的一些字符？ .replace（）不起作用

# -*- coding: utf-8 
from prestapyt import PrestaShopWebService 
from xml.etree import ElementTree 

prestashop = PrestaShopWebService('http://localhost/prestashop/api', 
           'key') 
prestashop.debug = True 

name = ElementTree.tostring(prestashop.search('products', options= 
{'display': '[name]', 'filter[id]': '[2]'}), encoding='cp852', 
method='text') 

print name 
print name.replace('ł', 'l')

輸出：

但是，當我嘗試更換非拋光字符，它工作正常。

print name 
print name.replace('a', 'o')

結果：

Naturalne mydło odświeżające 
Noturolne mydło odświeżojące

這也工作的優良：

name = "Naturalne mydło odświeżające" 
print name.replace('ł', 'l')

任何建議？

來源

2017-09-16 vex

您需要兩個字符串的Unicode形式歸一化爲相同的[正常形式]（https://en.m.wikipedia.org/wiki/Unicode_equivalence）。 –

[可以somone解釋unicodedata.normalize（form，unistr）如何與示例一起使用的可能的副本？]（https://stackoverflow.com/questions/14682397/can-somone-explain-how-unicodedata-normalizeform-unistr-work -with-examples） –

您正在將編碼與您的字節字符串混合。以下是重現問題的簡短工作示例。我假設你在Windows控制檯運行默認的cp852編碼：

#!python2 
# coding: utf-8 
from xml.etree import ElementTree as et 
name_element = et.Element('data') 
name_element.text = u'Naturalne mydło odświeżające' 
name = et.tostring(name_element,encoding='cp852', method='text') 
print name 
print name.replace('ł', 'l')

輸出（沒有替換）：

原因是，該name串在cp852但編碼字節字符串常量'ł'以utf-8的源代碼編碼進行編碼。

print repr(name) 
print repr('ł')

輸出：

'Naturalne myd\x88o od\x98wie\xbeaj\xa5ce' 
'\xc5\x82'

最好的解決方案是使用Unicode字符串：

#!python2 
# coding: utf-8 
from xml.etree import ElementTree as et 
name_element = et.Element('data') 
name_element.text = u'Naturalne mydło odświeżające' 
name = et.tostring(name_element,encoding='cp852', method='text').decode('cp852') 
print name 
print name.replace(u'ł', u'l') 
print repr(name) 
print repr(u'ł')

輸出（作了替代）：

Naturalne mydło odświeżające 
Naturalne mydlo odświeżające 
u'Naturalne myd\u0142o od\u015bwie\u017caj\u0105ce' 
u'\u0142'

注意，Python 3中的et.tostring有一個Unicode選項，字符串常量默認是Unicode。 repr()版本的字符串也更具可讀性，但ascii()實現了舊的行爲。你還會發現Python 3.6將打印波蘭語，甚至不使用波蘭代碼頁的遊戲機，所以也許你根本不需要替換字符。

#!python3 
# coding: utf-8 
from xml.etree import ElementTree as et 
name_element = et.Element('data') 
name_element.text = 'Naturalne mydło odświeżające' 
name = et.tostring(name_element,encoding='unicode', method='text') 
print(name) 
print(name.replace('ł','l')) 
print(repr(name),repr('ł')) 
print(ascii(name),ascii('ł'))

輸出：

Naturalne mydło odświeżające 
Naturalne mydlo odświeżające 
'Naturalne mydło odświeżające' 'ł' 
'Naturalne myd\u0142o od\u015bwie\u017caj\u0105ce' '\u0142'

來源

2017-09-16 22:42:15

非常感謝！編碼/解碼對我來說仍然有點棘手，所以我想我將不得不學習Unicode Howto。我也會考慮轉向python 3.x. – vex

如果我正確理解你的問題，你可以使用unidecode：

>>> from unidecode import unidecode 
>>> unidecode("Naturalne mydło odświeżające") 
'Naturalne mydlo odswiezajace'

您可能需要您的CP852編碼字符串name.decode('utf_8')第一解碼。

來源

2017-09-16 21:39:35

謝謝！我已經實現了你的消化，現在一切工作都很好。 – vex

如何擺脫字符串中的一些字符？ .replace（）不起作用

回答

相關問題