從文本文件中讀取沒有明確分隔符的列

我有這個文本文件：http://henke.lbl.gov/tmp/xray6286.dat
其中我想列出能量和傳輸列。從文本文件中讀取沒有明確分隔符的列

不幸的是，它沒有明確的分隔符 - 單詞之間由一系列空格分隔。

運行像

with open('xray6286.dat', 'U') as data: 
reader = csv.reader(data, delimiter=' ') 
for line in reader: 
    print line

會導致類似的輸出：

['', 'Cu', 'Density=8.96', 'Thickness=100.', 'microns'] 
['', 'Photon', 'Energy', '(eV),', 'Transmission'] 
['', '', '', '', '5000.0', '', '', '', '', '', '0.52272E-07'] 
['', '', '', '', '5250.0', '', '', '', '', '', '0.42227E-06'] 
['', '', '', '', '5500.0', '', '', '', '', '', '0.24383E-05']

，我可以蠻力它給我我想要的值用下面的代碼：

import csv 

energy = [] 
transmission = [] 

with open('xray6286.dat', 'U') as data: 
    reader = csv.reader(data, delimiter='\n') 
    for line in reader: 
     if reader.line_num > 2: 
      cleaned_line = [] 
      for word in line[0].split(' '): 
       if word: 
        cleaned_line.append(word) 
      energy.append(cleaned_line[0]) 
      transmission.append(cleaned_line[1])

但我想知道如果有人知道更多..很常見的方式來實現這一目標？

來源

2015-01-09 Ben

regex split方法可以基於任意數量的空格分隔數據點。

import re 

for word in re.split(r'\s+', line): 
    print word

來源

2015-03-15 19:54:04 Ben

只要你能避免使用正則表達式，你應該這樣做，正則表達式計算起來很昂貴。 – alfasin

@alfasin好的，這是很好的知道。預編譯正則表達式有多少幫助？ 'spaces = re.compile（r'\ s +'）; spaces.split（線）;'？ – Ben

編譯並不耗時 - 它是重要的解析部分。 – alfasin

使用if word:非常好。另一種選擇是filter出零點通過更換：

for word in line[0].split(' '):

有：

for word in filter(bool, line[0].split(' ')):

來源

2015-01-10 00:01:31 alfasin

然後，您可以itterate存儲在數據結構的結果通過它並刪除空條目。 @alfasin提出了最好的想法，即使用filter

來源

2015-01-10 00:09:07 LukeP

從文本文件中讀取沒有明確分隔符的列

回答

相關問題