對從包含Unicode的HTML文件讀取的字符串使用.replace（）方法

我想以原始文本的形式讀取.html文件，並用另一個子字符串替換包含unicode字符的子字符串的實例。假設該文件mm03.html只有一行文字：對從包含Unicode的HTML文件讀取的字符串使用.replace（）方法

<span style='font-size:14.0pt'>«test»</span>

我想讀mm03.html，解析它的原始文本作爲一個字符串，並調用替換，使輸出看起來就像這樣：

<span style='font-size:14.0pt'>TEST</span>

我第一次嘗試這樣做，我寫了下面的代碼...

# -*- coding: utf-8 -*- 
import codecs 
htmlBase = codecs.open("mm03.html",'r') 
htmlFill = htmlBase.read() 
print htmlFill 
htmlFill = htmlFill.replace("«test»","TEST") 
print htmlFill 
htmlBase.close()

...並期望它會先打印原廠l上面列出的行，然後是第二行。相反，它將第一行列出兩次。

好的。所以這可能是一個Unicode解碼問題，對吧？也許，但是當我根據遍佈於本網站的與Unicode相關的建議修改代碼時，不同色調的問題依然存在。此外，所期望的功能可以通過顯式定義htmlBase爲實現...

htmlBase = """<span style='font-size:14.0pt'>«test»</span>"""

...這使我相信有什麼東西我不知道在Python閱讀HTML文件。我已經嘗試在'w'模式下打開mmo3.html，但這似乎不起作用，往往會破壞原始文件。從只讀文件中讀取的字符串本身應該是隻讀的，但沒有多大意義，但我可能是錯的。

以下是我咀嚼過的幾個腳本/輸出對。

添加未加引號字符 'U' 之前，我希望要替換的字符串

# -*- coding: utf-8 -*- 
import codecs 
htmlBase = codecs.open("mm03.html",'r') 
htmlFill = htmlBase.read() 
print htmlFill 
htmlFill = htmlFill.replace(u"«test»","TEST") 
print htmlFill 
htmlBase.close()

輸出：

<span style='font-size:14.0pt'>½test╗</span> 
Traceback (most recent call last): 
    File "test2.py", line 6, in <module> 
    htmlFill = htmlFill.replace(u"┬½test┬╗","TEST") 
UnicodeDecodeError: 'ascii' codec can't decode byte 0xab in position 31: ordinal not in range(128)

應用.decode（ 'UTF-8'）到從.read（）傳遞的字符串

# -*- coding: utf-8 -*- 
import codecs 
htmlBase = codecs.open("mm03.html",'r') 
htmlFill = htmlBase.read().decode('utf-8') 
print htmlFill 
htmlFill = htmlFill.replace(u"«test»","TEST") 
print htmlFill 
htmlBase.close()

輸出：

Traceback (most recent call last): 
    File "test2.py", line 4, in <module> 
    htmlFill = htmlBase.read().decode('utf-8') 
    File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode 
    return codecs.utf_8_decode(input, errors, True) 
UnicodeDecodeError: 'utf8' codec can't decode byte 0xab in position 31: invalid start byte

應用.encode（ 'UTF-8'）至（）從.read傳遞的字符串

# -*- coding: utf-8 -*- 
import codecs 
htmlBase = codecs.open("mm03.html",'r') 
htmlFill = htmlBase.read().encode('utf-8') 
print htmlFill 
htmlFill = htmlFill.replace(u"«test»","TEST") 
print htmlFill 
htmlBase.close()

輸出：

Traceback (most recent call last): 
    File "test2.py", line 4, in <module> 
    htmlFill = htmlBase.read().encode('utf-8') 
UnicodeDecodeError: 'ascii' codec can't decode byte 0xab in position 31: ordinal not in range(128)

應用.decode （'utf-8'）添加到從.read（）傳遞的字符串中，而在目標子字符串中沒有「u」後綴

# -*- coding: utf-8 -*- 
import codecs 
htmlBase = codecs.open("mm03.html",'r') 
htmlFill = htmlBase.read().decode('utf-8') 
print htmlFill 
htmlFill = htmlFill.replace("«test»","TEST") 
print htmlFill 
htmlBase.close()

輸出：

Traceback (most recent call last): 
    File "test2.py", line 4, in <module> 
    htmlFill = htmlBase.read().decode('utf-8') 
    File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode 
    return codecs.utf_8_decode(input, errors, True) 
UnicodeDecodeError: 'utf8' codec can't decode byte 0xab in position 31: invalid start byte

應用.encode（ 'UTF-8'）從傳遞的字符串。閱讀（），而「U」後綴的目標串

# -*- coding: utf-8 -*- 
import codecs 
htmlBase = codecs.open("mm03.html",'r') 
htmlFill = htmlBase.read().encode('utf-8') 
print htmlFill 
htmlFill = htmlFill.replace("«test»","TEST") 
print htmlFill 
htmlBase.close()

輸出：

Traceback (most recent call last): 
    File "test2.py", line 4, in <module> 
    htmlFill = htmlBase.read().encode('utf-8') 
UnicodeDecodeError: 'ascii' codec can't decode byte 0xab in position 31: ordinal not in range(128)

來源

2016-09-29 Bibliophael

你說的HTML文件中包含行'的«測試»'然後使用'取代（「«第一頁»」，「FIRST」）方法不起作用。 _當然不是_，因爲字符串««第一»'不在文件中。如果你使用'replace（「«test»」，「TEST」）''，它就可以工作。 – martineau

我現在修復了這個問題。我從我的實際案例中修改了某些值，但我想我錯過了一些。但它仍然不起作用。 – Bibliophael

在我提到的改變後爲我工作，沒有錯誤。 – martineau

你需要，你想讓它傳遞給str.replace()之前替換字符串解碼。這個工作對我來說：

# -*- coding: utf-8 -*- 
import codecs 
htmlBase = codecs.open("mm03.html",'r') 
htmlFill = htmlBase.read() 
htmlFill = codecs.decode(htmlFill,'utf-8') 
substr = codecs.decode("«test»",'utf-8') 
htmlFill = htmlFill.replace(substr,"TEST") 
print htmlFill 
htmlBase.close()

來源

2016-09-29 18:13:07 Ole

對從包含Unicode的HTML文件讀取的字符串使用.replace（）方法

回答

相關問題