我正在嘗試閱讀python中的gunzipped文件（.gz），並且遇到了一些麻煩。從python中的gzip文件中讀取utf-8字符

我用gzip的模塊讀取，但該文件編碼爲UTF-8文本文件，以便最終讀取無效字符和崩潰。

有誰知道如何讀取編碼爲utf-8文件的gzip文件？我知道有一個編解碼器模塊可以幫助，但我無法理解如何使用它。

謝謝！

import string 
import gzip 
import codecs 

f = gzip.open('file.gz','r') 

engines = {} 
line = f.readline() 
while line: 
    parsed = string.split(line, u'\u0001') 

    #do some things... 

    line = f.readline() 
for en in engines: 
    print(en)

來源

2009-12-10 Juan Besa

你可以發佈你到目前爲止的代碼嗎？ – 2009-12-10 20:03:42

你能否將utf-8文件轉換爲ascii然後嘗試解壓縮？嗯.... – whatsisname 2009-12-10 20:06:06

我不明白爲什麼這應該是如此艱難。

你到底在做什麼？請解釋「最終它讀取的是無效字符」。

它應該是簡單的：

import gzip 
fp = gzip.open('foo.gz') 
contents = fp.read() # contents now has the uncompressed bytes of foo.gz 
fp.close() 
u_str = contents.decode('utf-8') # u_str is now a unicode string

EDITED

這個答案在Python3工程Python2，請參閱@SeppoEnarvi的答案在https://stackoverflow.com/a/19794943/610569（它使用rt模式gzip.open。

來源

2009-12-10 20:11:27 sjbrown

+1 ...這是迄今爲止答案中最清晰和最複雜的3個答案。 – 2009-12-10 22:49:23

不一定是最簡單的，因爲你必須解碼你閱讀的每一行。在getreader實現中，這會自動發生，所以每行都是unicode – SecurityJoe 2012-01-05 20:37:04

儘管這是一個很好的解決方案，但我有一種感覺，這種解決方案在大文件上不能很好地擴展。 – 2016-11-09 15:59:20

也許

import codecs 
zf = gzip.open(fname, 'rb') 
reader = codecs.getreader("utf-8") 
contents = reader(zf) 
for line in contents: 
    pass

來源

2009-12-10 20:21:02

作爲一行代碼：用於codecs.getreader（'utf-8'）（gzip.open（fname），errors ='replace'）中的行，這也增加了對錯誤處理的控制 – SecurityJoe 2012-01-05 20:38:05

在Python的形式（2.5或更高版本）

from __future__ import with_statement # for 2.5, does nothing in 2.6 
from gzip import open as gzopen 

with gzopen('foo.gz') as gzfile: 
    for line in gzfile: 
     print line.decode('utf-8')

來源

2009-12-10 20:26:12

這是可能在Python 3.3：

import gzip 
gzip.open('file.gz', 'rt', encoding='utf-8')

是gzip.open（通知）要求您顯式地指定文本模式（ 'T'）。

來源

2013-11-05 17:20:37

上面產生了大量的解碼錯誤。我用這個：

for line in io.TextIOWrapper(io.BufferedReader(gzip.open(filePath)), encoding='utf8', errors='ignore'): 
    ...

來源

2014-08-10 20:13:14 Yurik

從python中的gzip文件中讀取utf-8字符

回答

EDITED

相關問題