2014-11-21 71 views
0

我想解析數據庫文件的每一行,讓它準備好導入。它有固定的寬度線,但是以字符爲單位,而不是以字節爲單位。我編寫了一些基於Martineau's answer的內容,但我遇到了特殊字符的問題。使用特殊字符解壓縮固定寬度的unicode文件行。 Python UnicodeDecodeError

有時他們會打破預期的寬度,有些時候他們會拋出UnicodeDecodeError。我相信解碼錯誤可能是固定的,但我可以繼續這樣做struct.unpack並正確解碼特殊字符?我認爲問題在於它們被編碼爲多個字節,與預期的字段寬度相混淆,我理解它是以字節爲單位而不是字符。

import os, csv 

def ParseLine(arquivo): 
    import struct, string 
    format = "1x 12s 1x 18s 1x 16s" 
    expand = struct.Struct(format).unpack_from 
    unpack = lambda line: tuple(s.decode() for s in expand(line.encode())) 
    for line in arquivo: 
     fields = unpack(line) 
     yield [x.strip() for x in fields] 

Caminho = r"C:\Sample" 
os.chdir(Caminho) 

with open("Sample data.txt", 'r') as arq: 
    with open("Out" + ".csv", "w", newline ='') as sai: 
     Write = csv.writer(sai, delimiter= ";", quoting=csv.QUOTE_MINIMAL).writerows 
     for line in ParseLine(arq): 
      Write([line]) 

樣本數據:

|  field 1|  field 2  |  field 3 | 
| sreaodrsa | raesodaso t.thl o| .tdosadot. osa | 
| resaodra | rôn. 2x 17/220V | sreao.tttra v | 
| esarod sê | raesodaso t.thl o| .tdosadot. osa | 
| esarod sa í| raesodaso t.thl o| .tdosadot. osa | 

實際輸出:

field 1;field 2;field 3 
sreaodrsa;raesodaso t.thl o;.tdosadot. osa 
resaodra;rôn. 2x 17/22;V | sreao.tttra 

在我們看到線1的輸出和2如預期。第3行有錯誤的寬度,可能是由於多字節ô。第4行拋出以下異常:

Traceback (most recent call last): 
    File "C:\Sample\FindSample.py", line 18, in <module> 
    for line in ParseLine(arq): 
    File "C:\Sample\FindSample.py", line 9, in ParseLine 
    fields = unpack(line) 
    File "C:\Sample\FindSample.py", line 7, in <lambda> 
    unpack = lambda line: tuple(s.decode() for s in expand(line.encode())) 
    File "C:\Sample\FindSample.py", line 7, in <genexpr> 
    unpack = lambda line: tuple(s.decode() for s in expand(line.encode())) 
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 11: unexpected end of data 

我需要對每個字段進行especific操作,所以當我做之前,我不能對整個文件使用re.sub。我想保留這段代碼,因爲它看起來很有效率,並且處於工作的邊緣。如果有更高效的解析方法,我可以嘗試一下。我需要保留特殊字符。

回答

0

事實上,struct方式落下這裏,因爲它預計領域是字節寬固定數量的,而你的格式使用的碼點一個固定的數字。我不想在這裏使用struct。你的線條已經被解碼爲Unicode值,只需用切片提取數據:

def ParseLine(arquivo): 
    slices = [slice(1, 13), slice(14, 32), slice(33, 49)] 
    for line in arquivo: 
     yield [line[s].strip() for s in slices] 

該交易完全在字符在已解碼行,而不是字節。如果您有字段寬度,而不是指數,也可以產生slice()對象:

def widths_to_slices(widths): 
    pos = 0 
    for width in widths: 
     pos += 1 # delimiter 
     yield slice(pos, pos + width) 
     pos += width 

def ParseLine(arquivo): 
    widths = (12, 18, 16) 
    for line in arquivo: 
     yield [line[s].strip() for s in widths_to_slices(widths)] 

演示:

>>> sample = '''\ 
... |  field 1|  field 2  |  field 3 | 
... | sreaodrsa | raesodaso t.thl o| .tdosadot. osa | 
... | resaodra | rôn. 2x 17/220V | sreao.tttra v | 
... | esarod sê | raesodaso t.thl o| .tdosadot. osa | 
... | esarod sa í| raesodaso t.thl o| .tdosadot. osa | 
... '''.splitlines() 
>>> def ParseLine(arquivo): 
...  slices = [slice(1, 13), slice(14, 32), slice(33, 49)] 
...  for line in arquivo: 
...   yield [line[s].strip() for s in slices] 
... 
>>> for line in ParseLine(sample): 
...  print(line) 
... 
['field 1', 'field 2', 'field 3'] 
['sreaodrsa', 'raesodaso t.thl o', '.tdosadot. osa'] 
['resaodra', 'rôn. 2x 17/220V', 'sreao.tttra v'] 
['esarod sê', 'raesodaso t.thl o', '.tdosadot. osa'] 
['esarod sa í', 'raesodaso t.thl o', '.tdosadot. osa'] 
+0

我已經使用timeit對一個150MB的文件,這兩種方法進行比較。結構方法在108秒內運行,而切片則花費了67分鐘。我必須做一些調整才能將它加入我的代碼中,這可能會讓代碼更快,但我現在確信切片是一種很好的方法。謝謝! – mvbentes 2014-11-22 17:36:11

相關問題