UnicodeDecodeError處理文件名時

我在Ubuntu 12 x64上使用Python 2.7.3。UnicodeDecodeError處理文件名時

我的文件系統上的文件夾中有大約200,000個文件。某些文件的文件名包含html編碼和轉義字符，因爲這些文件最初是從網站下載的。這裏是一些例子：

Jamaica%2008%20114.jpg
thai_trip_%E8%B0%83%E6%95%B4%E5%A4%A7%E5%B0%8F%20RAY_5313.jpg

我寫了一個簡單的Python腳本，通過該文件夾並重命名所有文件中的編碼字符的文件。通過簡單解碼構成文件名的字符串就可以實現新文件名。

該腳本適用於大多數的文件，但是，對於一些Python噎文件並吐出以下錯誤：

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 11: ordinal not in range(128) 
Traceback (most recent call last): 
    File "./download.py", line 53, in downloadGalleries 
    numDownloaded = downloadGallery(opener, galleryLink) 
    File "./download.py", line 75, in downloadGallery 
    filePathPrefix = getFilePath(content) 
    File "./download.py", line 90, in getFilePath 
    return cleanupString(match.group(1).strip()) + '/' + cleanupString(match.group(2).strip()) 
    File "/home/abc/XYZ/common.py", line 22, in cleanupString 
    return HTMLParser.HTMLParser().unescape(string) 
    File "/usr/lib/python2.7/HTMLParser.py", line 472, in unescape 
    return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s) 
    File "/usr/lib/python2.7/re.py", line 151, in sub 
    return _compile(pattern, flags).sub(repl, string, count)

這裏是我的cleanupString函數的內容：

def cleanupString(string): 
    string = urllib2.unquote(string) 

    return HTMLParser.HTMLParser().unescape(string)

下面是調用cleanupString函數的代碼片段（該代碼與上面的回溯代碼不同，但會產生相同的錯誤）：

rootFolder = sys.argv[1] 
pattern = r'.*\.jpg\s*$|.*\.jpeg\s*$' 
reobj = re.compile(pattern, re.IGNORECASE) 
imgs = [] 

for root, dirs, files in os.walk(rootFolder): 
    for filename in files: 
     foundFile = os.path.join(root, filename) 

     if reobj.match(foundFile): 
      imgs.append(foundFile) 

for img in imgs : 
    print 'Checking file: ' + img 
    newImg = cleanupString(img) #Code blows up here for some files

任何人都可以提供一種解決此錯誤的方法嗎？我已經嘗試將

# -*- coding: utf-8 -*-

添加到腳本的頂部，但沒有任何效果。

謝謝。

來源

2012-09-27 Justin Kredible

您的文件名是包含表示unicode字符的UTF-8字節的字節字符串。 HTML解析器通常使用unicode數據而不是字節字符串，特別是遇到連字符轉義時，所以Python會自動嘗試爲您解碼值，但默認情況下使用ASCII進行解碼。這對於UTF-8數據來說是失敗的，因爲它包含超出ASCII範圍的字節。

您需要將您的字符串明確解碼爲unicode的對象：

def cleanupString(string): 
    string = urllib2.unquote(string).decode('utf8') 

    return HTMLParser.HTMLParser().unescape(string)

你的下一個問題將是，你現在有Unicode文件名，但你的文件系統將需要某種形式的編碼與這些文件名的工作。你可以查看sys.getfilesystemencoding()是什麼編碼;用此來重新編碼的文件名：

def cleanupString(string): 
    string = urllib2.unquote(string).decode('utf8') 

    return HTMLParser.HTMLParser().unescape(string).encode(sys.getfilesystemencoding())

您可以在Unicode HOWTO使用Unicode如何Python的交易閱讀起來。

來源

2012-09-27 16:46:41

在Linux上使用文件名時需要保持警惕。沒有使用設置的字符編碼，即使配置了明文（通常是UTF-8），仍然可以獲取不符合它的文件名。你需要把它們當作原始字節字符串來處理，或者如果你得到一個無效的名字，至少不會翻倒。 – spencercw

想想看，他編碼的文件名甚至不一定是UTF-8，這可能會使事情變得有趣。 – spencercw

@spencercw：他給出的例子是。他的錯誤信息中的'\ xe2'字節是另一個線索，這是一個典型的UTF-8替代品。 –

看起來像是碰到this issue。我會嘗試顛倒您撥打unescape和unquote的訂單，因爲unquote會在您的文件名中添加非ASCII字符，但這可能無法解決問題。

什麼是它窒息的實際文件名？

來源

2012-09-27 16:46:00 spencercw

其實，考慮一下，這很可能是潛在的問題，使您的答案正確*以及*。只是不如我的完整。 :-) –

@spencercw：這是腳本窒息的文件名。/ Galleries/nath & a heartsofl♥ve /♥♥♥粉紅色♥♥♥/ DSC_0080.JPG –

@spencercw：嘗試了您的建議，但這並不奏效，同樣的錯誤。 –

UnicodeDecodeError處理文件名時

回答

相關問題