2014-10-08 31 views
1

我有一個大的文本文件拆分一個大的文本文件,它看起來像這樣:的Python:許多頭

lat lon altitude pressure 
3 lines data group bsas 
2.3 4.5 45.0 875 
5.6 6.5 46.2 676 
3.4 3.4 48.2 565 
6 lines data group sdad 
3.4 4.5 56.1 535 
5.6 6.5 46.2 676  
3.4 4.5 56.1 535 
2.3 4.5 45.0 875 
5.6 6.5 46.2 676 
3.4 3.4 48.2 565 
50 lines data group asdasd 
5.5 6.6 44.5 343 
... 
3.7 8.4 56.5 456 
... and so on 

我想要分割整個文本文件中單獨的數據組,每個數據組將存儲在二維數組中。直到現在我已經嘗試了兩種方式來做到這一點。

第一種方式正在經歷的每一行,並得到數據如下:

# define an object class called Wave here 
# each object has 4 attributes: lat, lon, altitude, pressure 
wave_list = [] 
with open(filename, 'r') as f: 
    next(f) # skip the header 
    wave = Wave() 
    for i, line in enumerate(f): 
     if 'data' in line: 
      if wave is not empty: 
       wave_list.append(wave) 
      wave = Wave() 
     else: 
      wave.lat.append(line.split()[0]) 
      wave.lon.append(line.split()[1]) 
      wave.altitude.append(line.split()[2]) 
      wave.pressure.append(line.split()[3]) 
     wave_list.append(wave) 
return wave_list 

第二種方法是使用numpy的loadtext:

f = open(filename, 'r') 
txt = f.read() 
# split by "data", remove the first element 
raw_chunks = txt.split("data")[1:] 
# define a new list to store results 
wave_list = [] 
# go through each chunk 
for rc in raw_chunks: 
    # find the fisrt index of "\n" 
    first_id = rc.find("\n") 
    # find the last index of "\n" 
    last_id = rc.rfind("\n") 
    # temporary chunk 
    temp_chunk = rc[first_id:last_id] 
    # load data using loadtxt 
    data = np.loadtxt(StringIO(temp_chunk)   
    wave = Wave() 
    wave.lat = data.T[0] 
    wave.lon = data.T[1] 
    wave.altitude = data.T[2] 
    wave.pressure = data.T[3] 
    wave_list.append(wave) 
return wave_list 

然而,這兩種方法都相當緩慢。我看看熊貓文檔,但無法找到避免文件中間標題的方法。我也看看不同的問題的例子:

Splitting a file based on text in Python

Split the text file in python

How to split and parse a big text file in python in a memory-efficient way?

但它們都沒有解決我的問題。有沒有更快的方法來閱讀這種文本文件。先謝謝你。

+0

你想拆就什麼數據? – 2014-10-08 20:25:54

+0

@Padraic上面顯示的數據爲例。或者你是什麼意思?對不起,我不是很瞭解 – 2014-10-08 20:45:04

+0

是的,你想分裂哪裏有文字? – 2014-10-08 20:45:51

回答

1

搜索以<number> lines data group <something>啓動線,存儲該組(<something>)和行數來讀取(<number>),那麼它匹配時,存儲所述Ñ以下行到,例如:

鑑於以下代碼:

from itertools import islice 
from collections import defaultdict 
import re 

data = defaultdict(list) 
with open(filename) as fin: 
    header = next(fin, '').split() 
    for line in fin: 
     m = re.match(r'(\d+) lines.*(\b\w+)$', line) 
     if m: 
      data[m.group(2)].extend(islice(fin, int(m.group(1)))) 

給定輸入的:

lat lon altitude pressure 
3 lines data group bsas 
2.3 4.5 45.0 875 
5.6 6.5 46.2 676 
3.4 3.4 48.2 565 
6 lines data group sdad 
3.4 4.5 56.1 535 
5.6 6.5 46.2 676  
3.4 4.5 56.1 535 
2.3 4.5 45.0 875 
5.6 6.5 46.2 676 
3.4 3.4 48.2 565 

給你data爲:

{'bsas': ['2.3 4.5 45.0 875\n', '5.6 6.5 46.2 676\n', '3.4 3.4 48.2 565\n'], 
'sdad': ['3.4 4.5 56.1 535\n', 
      '5.6 6.5 46.2 676 \n', 
      '3.4 4.5 56.1 535\n', 
      '2.3 4.5 45.0 875\n', 
      '5.6 6.5 46.2 676\n', 
      '3.4 3.4 48.2 565\n']} 

而且您的意見,如果 「組」 是微不足道的,則:

data = [] 
with open(filename) as fin: 
    header = next(fin, '').split() 
    for line in fin: 
     m = re.match(r'(\d+) lines.*(\b\w+)$', line) 
     if m: 
      data.append(list(islice(fin, int(m.group(1))))) 
+0

實際上,「行數據組」的行在數字前面包含未定義數量的空白。我該如何克服這一點?我應該使用're.search'而不是're.match'還是應該使用'line.lstrip()'?哪一個更快? – 2014-10-09 11:19:16

+0

@ Jon Clements:哦,我發現添加'\ s *'可能是忽略這些空格的最好方法。現在,如果我想將這些數據組存儲在列表或數組中,而不是字典中,我該怎麼做?原因在於,「data group」後面的字符串是任意的,並且在某些行中可能是相同的,所以我不希望它成爲字典的關鍵。謝謝 – 2014-10-09 11:34:31

+0

@hoangtran所以如果你有'1個數據組一個\ nblah \ n1數據組一個\ nblah2 \ n'那麼你需要兩個列表,而不是一個包含['blah','blah2']'的列表,因爲這種方法給了您? – 2014-10-09 13:12:48