2016-09-29 37 views
0

我想以原始文本的形式讀取.html文件,並用另一個子字符串替換包含unicode字符的子字符串的實例。假設該文件mm03.html只有一行文字:對從包含Unicode的HTML文件讀取的字符串使用.replace()方法

<span style='font-size:14.0pt'>«test»</span> 

我想讀mm03.html,解析它的原始文本作爲一個字符串,並調用替換,使輸出看起來就像這樣:

<span style='font-size:14.0pt'>TEST</span> 

我第一次嘗試這樣做,我寫了下面的代碼...

# -*- coding: utf-8 -*- 
import codecs 
htmlBase = codecs.open("mm03.html",'r') 
htmlFill = htmlBase.read() 
print htmlFill 
htmlFill = htmlFill.replace("«test»","TEST") 
print htmlFill 
htmlBase.close() 

...並期望它會先打印原廠l上面列出的行,然後是第二行。相反,它將第一行列出兩次。

好的。所以這可能是一個Unicode解碼問題,對吧?也許,但是當我根據遍佈於本網站的與Unicode相關的建議修改代碼時,不同色調的問題依然存在。此外,所期望的功能可以通過顯式定義htmlBase爲實現...

htmlBase = """<span style='font-size:14.0pt'>«test»</span>""" 

...這使我相信有什麼東西我不知道在Python閱讀HTML文件。我已經嘗試在'w'模式下打開mmo3.html,但這似乎不起作用,往往會破壞原始文件。從只讀文件中讀取的字符串本身應該是隻讀的,但沒有多大意義,但我可能是錯的。

以下是我咀嚼過的幾個腳本/輸出對。

  1. 添加未加引號字符 'U' 之前,我希望要替換的字符串

    # -*- coding: utf-8 -*- 
    import codecs 
    htmlBase = codecs.open("mm03.html",'r') 
    htmlFill = htmlBase.read() 
    print htmlFill 
    htmlFill = htmlFill.replace(u"«test»","TEST") 
    print htmlFill 
    htmlBase.close() 
    

    輸出:

    <span style='font-size:14.0pt'>½test╗</span> 
    Traceback (most recent call last): 
        File "test2.py", line 6, in <module> 
        htmlFill = htmlFill.replace(u"«test»","TEST") 
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xab in position 31: ordinal not in range(128) 
    
  2. 應用.decode( 'UTF-8')到從.read()傳遞的字符串

    # -*- coding: utf-8 -*- 
    import codecs 
    htmlBase = codecs.open("mm03.html",'r') 
    htmlFill = htmlBase.read().decode('utf-8') 
    print htmlFill 
    htmlFill = htmlFill.replace(u"«test»","TEST") 
    print htmlFill 
    htmlBase.close() 
    

    輸出:

    Traceback (most recent call last): 
        File "test2.py", line 4, in <module> 
        htmlFill = htmlBase.read().decode('utf-8') 
        File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode 
        return codecs.utf_8_decode(input, errors, True) 
    UnicodeDecodeError: 'utf8' codec can't decode byte 0xab in position 31: invalid start byte 
    
  3. 應用.encode( 'UTF-8')至()從.read傳遞的字符串

    # -*- coding: utf-8 -*- 
    import codecs 
    htmlBase = codecs.open("mm03.html",'r') 
    htmlFill = htmlBase.read().encode('utf-8') 
    print htmlFill 
    htmlFill = htmlFill.replace(u"«test»","TEST") 
    print htmlFill 
    htmlBase.close() 
    

    輸出:

    Traceback (most recent call last): 
        File "test2.py", line 4, in <module> 
        htmlFill = htmlBase.read().encode('utf-8') 
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xab in position 31: ordinal not in range(128) 
    
  4. 應用.decode ('utf-8')添加到從.read()傳遞的字符串中,而在目標子字符串中沒有「u」後綴

    # -*- coding: utf-8 -*- 
    import codecs 
    htmlBase = codecs.open("mm03.html",'r') 
    htmlFill = htmlBase.read().decode('utf-8') 
    print htmlFill 
    htmlFill = htmlFill.replace("«test»","TEST") 
    print htmlFill 
    htmlBase.close() 
    

    輸出:

    Traceback (most recent call last): 
        File "test2.py", line 4, in <module> 
        htmlFill = htmlBase.read().decode('utf-8') 
        File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode 
        return codecs.utf_8_decode(input, errors, True) 
    UnicodeDecodeError: 'utf8' codec can't decode byte 0xab in position 31: invalid start byte 
    
  5. 應用.encode( 'UTF-8')從傳遞的字符串。閱讀(),而「U」後綴的目標串

    # -*- coding: utf-8 -*- 
    import codecs 
    htmlBase = codecs.open("mm03.html",'r') 
    htmlFill = htmlBase.read().encode('utf-8') 
    print htmlFill 
    htmlFill = htmlFill.replace("«test»","TEST") 
    print htmlFill 
    htmlBase.close() 
    

    輸出:

    Traceback (most recent call last): 
        File "test2.py", line 4, in <module> 
        htmlFill = htmlBase.read().encode('utf-8') 
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xab in position 31: ordinal not in range(128) 
    
+0

你說的HTML文件中包含行'的«測試»'然後使用'取代(「«第一頁»」 ,「FIRST」)方法不起作用。 _當然不是_,因爲字符串««第一»'不在文件中。如果你使用'replace(「«test»」,「TEST」)'',它就可以工作。 – martineau

+0

我現在修復了這個問題。我從我的實際案例中修改了某些值,但我想我錯過了一些。但它仍然不起作用。 – Bibliophael

+0

在我提到的改變後爲我工作,沒有錯誤。 – martineau

回答

0

你需要,你想讓它傳遞給str.replace()之前替換字符串解碼。這個工作對我來說:

# -*- coding: utf-8 -*- 
import codecs 
htmlBase = codecs.open("mm03.html",'r') 
htmlFill = htmlBase.read() 
htmlFill = codecs.decode(htmlFill,'utf-8') 
substr = codecs.decode("«test»",'utf-8') 
htmlFill = htmlFill.replace(substr,"TEST") 
print htmlFill 
htmlBase.close() 
相關問題