2017-07-31 30 views
2

我有以下數據集:如何解析不同數據集之間的數組?

File1中:

<Molecular Orbital Primitive Coefficients> 
<MO Number> 
1 
</MO Number> 
4.224609607748e+00 4.085857782359e+00 1.273383604708e+00 -6.802974691818e-03 
9.099528133406e-03 6.867550219273e-03 5.859231188647e-03 3.684441849425e-03 
5.836775773317e-04 -2.316776085880e-16 -1.456850991492e-16 -2.307897076406e-17 
4.140895678156e-03 2.603906355541e-03 4.125025757803e-04 -1.739011495381e-03 
-1.681896173898e-03 -5.241735641835e-04 -1.739011375813e-03 -1.681896058258e-03 
-5.241735281434e-04 
<MO Number> 
2 
</MO Number> 
-9.785273892788e-01 -9.463889258321e-01 -2.949481372149e-01 -1.974411643609e-01 
2.640935048539e-01 1.993153249903e-01 2.392564397119e-01 1.504508715968e-01 
2.383394930083e-02 8.865383702284e-16 5.574791243465e-16 8.831407252698e-17 
1.690897356483e-01 1.063281646128e-01 1.684417017817e-02 4.608108515392e-02 
4.456761845182e-02 1.388977974599e-02 4.608108208174e-02 4.456761548054e-02 
1.388977881997e-02 
</Molecular Orbital Primitive Coefficients> 

文件2:

<Molecular Orbital Primitive Coefficients> 
<MO Number> 
1 
</MO Number> 
3.299451113326e-02 6.087754902119e-02 9.880244651376e-02 1.066781206974e-01 
6.773109582562e-02 1.104778461514e-02 -2.156994392623e-02 3.071021124268e-17 
1.072251279194e-16 -1.396334606969e-02 -2.002731618626e-16 -9.993341885751e-17 
<MO Number> 
2 
</MO Number> 
-2.009498358678e-04 -3.707687449719e-04 -6.017466156746e-04 -9.474065009358e-02 
3.917924760214e-01 -1.299844008310e-01 1.579980866207e-01 -2.827902468319e-15 
1.152587596877e-15 -2.310895197449e-01 2.213502483059e-15 -1.048685827923e-15 
<MO Number> 
3 
</MO Number> 
-1.763944008217e-17 -3.254619757728e-17 -5.282150804455e-17 -3.109320915001e-16 
-9.633800372448e-16 -1.118676262789e-17 -1.336368133403e-15 -1.286598202313e+00 
-1.412088253954e+00 2.299271905206e-15 1.305465570574e+00 1.432795875849e+00 
3.494418486873e-16 -1.710573251253e-01 -1.877416268172e-01 -7.134748738863e-16 
</Molecular Orbital Primitive Coefficients> 

在所述陣列的大小和數組的數量的文件之間的這種數據集的變化(即,一些文件可能有70個數組,所以70個MO號碼,而另一些則有10個)。我正在嘗試編寫一個將MO Number標題之間的數據解析爲數組的函數。這是我到目前爲止:

def function3(start, end): 
    """Read MO information.""" 
    config_found = False 
    var = [] 
    for line in f: 
     if line.strip() == end: 
      config_found = False 
     elif config_found: 
      i = line.rstrip() 
      var.append(i) 
     elif line.strip() == start: 
      config_found = True 
    var1 = [elem.strip() for elem in var] 
    var2 = var1[1:-1] 
    var3 = np.array([line.split() for line in var2]) 
    var3 = np.asarray([list(map(float, item)) for item in var3]) 
    return var3 
m = {'start1':'1','end1':'2', 
     'start2':'2','end2':''} 
with open(filename, 'r') as f: 
    v['monumber1']=function3(m['start1'],m['end1']) 
    v['monumber2']=function3(m['start2'],m['end2']) 

這個問題是,我將需要爲某些文件設置這些變量70次!而且,最終數組的開始和結束變量不適用於所有文件。有沒有不同的方法來解決這個問題?

謝謝!

+2

有正則表達式和numpy的... –

+1

的可能性增加了'regex'和'numpy'標籤可以幫助! –

+1

您的數據源是否建議閱讀此標準?使用<>和建議一個xml模型。但只是鬆散的意思。 – hpaulj

回答

1

基於Vinicius的評論,我嘗試了一些正則表達式,請看看它是否有幫助。通常不推薦使用read()方法,但是由於在這個例子中我的數據不是太多,所以我使用它。

import re 

x = [] 
with open(filename, 'r') as fh: 
    x = re.findall(r'\d\.\d+e[-+]\d+', fh.read()) 

out = map(float, x) 

希望這可以幫助,根據您的意見,上述爲我工作。輸出如下的文件2:

[0.03299451113326, 0.06087754902119, 0.09880244651376, 0.1066781206974, 0.06773109582562, 0.01104778461514, 0.02156994392623, 3.071021124268e-17, 1.072251279194e-16, 0.01396334606969, 2.002731618626e-16, 9.993341885751e-17, 0.0002009498358678, 0.0003707687449719, 0.0006017466156746, 0.09474065009358, 0.3917924760214, 0.129984400831, 0.1579980866207, 2.827902468319e-15, 1.152587596877e-15, 0.2310895197449, 2.213502483059e-15, 1.048685827923e-15, 1.763944008217e-17, 3.254619757728e-17, 5.282150804455e-17, 3.109320915001e-16, 9.633800372448e-16, 1.118676262789e-17, 1.336368133403e-15, 1.286598202313, 1.412088253954, 2.299271905206e-15, 1.305465570574, 1.432795875849, 3.494418486873e-16, 0.1710573251253, 0.1877416268172, 7.134748738863e-16] 
+0

不完全......文件中有其他數據,所以我需要使用標題專門分析每個數組到一個單獨的變量。但是,謝謝你,但! – pennypeat

+0

也可以使用正則表達式,將嘗試獲得正確的正則表達式。有沒有比1,2..70更好的標題?我可以使用嗎? 歡迎您 –

+0

是的,您可以使用! – pennypeat