2015-02-11 63 views
2

我已經從CFD模擬以下數據:使用Python和大熊貓在一個文本文件分割數據

Average value for X = 0.5080000265E-0003 to 0.2489200234E-0001   
    Z = -.3141592741E+0001  
    Time = 0.7000032425E+0001  
     Y    P_g  
    0.1511904760E-0002 0.2565604063E+0006 
    0.4535714164E-0002 0.2565349844E+0006 
    0.7559523918E-0002 0.2565098906E+0006 
    0.1058333274E-0001 0.2564848125E+0006 
    0.1360714249E-0001 0.2564597656E+0006 
    0.1663095318E-0001 0.2564346563E+0006 
    0.1965476200E-0001 0.2564095625E+0006 
     ...     ... 
     ...     ... 
    0.1259419441E+0001 0.2549983125E+0006 
    0.1262443304E+0001 0.2549983125E+0006 
    0.1265467167E+0001 0.2549983125E+0006 
    0.1268491030E+0001 0.2549982656E+0006 
    Time = 0.7010014057E+0001  
     Y    P_g  
    0.1511904760E-0002 0.2565604063E+0006 
    0.4535714164E-0002 0.2565349844E+0006 
    0.7559523918E-0002 0.2565098906E+0006 
    0.1058333274E-0001 0.2564848125E+0006 
     ...     ... 
     ...     ... 
    0.1259419441E+0001 0.2549983125E+0006 
    0.1262443304E+0001 0.2549983125E+0006 
    0.1265467167E+0001 0.2549983125E+0006 
    0.1268491030E+0001 0.2549982656E+0006 
    Time = 0.7020006657E+0001  
     Y    P_g  
    0.1511904760E-0002 0.2565604063E+0006 
    0.1058333274E-0001 0.2564848125E+0006 
     ...     ... 

正如你可以從上面的例子中看到,該數據被分成由幾個垂直分區時間步標頭標記爲Time。在每個部分中,Y不會更改,但P_g確實會更改。爲了繪製數據,我需要將每個部分中的P_g列在下一列中。例如,這是我需要重新創建數據:

 Y    0.7000032425E+1  0.7020006657E+1  ... 
    0.1511904760E-0002 0.2565604063E+0006 0.2549982656E+0006 ... 
    0.4535714164E-0002 0.2565349844E+0006 0.2549982656E+0006 ... 
    0.7559523918E-0002 0.2565098906E+0006 0.2549982656E+0006 ... 
    0.1058333274E-0001 0.2564848125E+0006 0.2549982656E+0006 ... 
    0.1360714249E-0001 0.2564597656E+0006 0.2549982656E+0006 ... 

使用熊貓,我可以從文本文件中讀取數據,並創建具有Y值的新數據幀索引(行)和Time值作爲列:

import pandas as pd 

# Read in data from text file 
# ------------------------------------------------------------------------- 

# data frame from text file contents, skip first 4 rows, separate by variable 
# white space, no header 
df = pd.read_table('ROP_s_SD.dat', skiprows=4, sep='\s*', header=None) 

# Time data 
# ------------------------------------------------------------------------- 

# data frame of the rows that contain the Time string 
dftime = df.loc[df.ix[:,0].str.contains('Time')] 

t = dftime[2].tolist() # time list 
idx = dftime.index  # index of rows containing Time string 

# Y data 
# ------------------------------------------------------------------------- 

# grab values for y to create index for new data frame 
ido = idx[0]+2  # index of first y value 
idf = idx[1]  # index of last y value 
y = []    # empty list to store y values 

for i in range(ido, idf): # iterate through first section of y values 
    v = df.ix[i, 0]   # get y value from data frame 
    y.append(float(v))  # add y value to y list 

# New data frame 
# ------------------------------------------------------------------------ 

# empty data frame with y as index and t as columns 
dfnew = pd.DataFrame(None, index=y, columns=t) 
print('dfnew is \n', dfnew.head()) 

空數據幀的頭部,dfnew.head()看起來如下:

  7.000032 7.010014 7.020007 7.030043 7.040020 7.050035 7.060043 
0.001512  NaN  NaN  NaN  NaN  NaN  NaN  NaN 
0.004536  NaN  NaN  NaN  NaN  NaN  NaN  NaN 
0.007560  NaN  NaN  NaN  NaN  NaN  NaN  NaN 
0.010583  NaN  NaN  NaN  NaN  NaN  NaN  NaN 
0.013607  NaN  NaN  NaN  NaN  NaN  NaN  NaN 

     7.070004 7.080036 7.090022 ... 7.650011 7.660032 7.670026 
0.001512  NaN  NaN  NaN ...   NaN  NaN  NaN 
0.004536  NaN  NaN  NaN ...   NaN  NaN  NaN 
0.007560  NaN  NaN  NaN ...   NaN  NaN  NaN 
0.010583  NaN  NaN  NaN ...   NaN  NaN  NaN 
0.013607  NaN  NaN  NaN ...   NaN  NaN  NaN 

     7.680044 7.690029 7.700008 7.710012 7.720014 7.730019 7.740026 
0.001512  NaN  NaN  NaN  NaN  NaN  NaN  NaN 
0.004536  NaN  NaN  NaN  NaN  NaN  NaN  NaN 
0.007560  NaN  NaN  NaN  NaN  NaN  NaN  NaN 
0.010583  NaN  NaN  NaN  NaN  NaN  NaN  NaN 
0.013607  NaN  NaN  NaN  NaN  NaN  NaN  NaN 

[5 rows x 75 columns] 

Ť每欄中的NaN應包含來自該特定Time部分的P_g值。我如何將每個部分的P_g值添加到各自的列中?

我正在閱讀的文本文件可以下載here

回答

1

看起來你已經完成了大部分的辛勤工作......下面幾行完成解開你的數據框:

# Add one more element to idx for correct indexing on the last column 
idx = list(idx) 
idx.append(len(df)) 

# Loop over the idx locations to fill the columns 
for i in range(len(dfnew.columns)): 
    dfnew.iloc[:, i] = df.iloc[idx[i]+2:idx[i+1], 1].values 

dfnew頭是現在的東西喜歡本作的第3列:

    7.000032   7.010014   7.020007 
0.001512 0.2565604063E+0006 0.2565604063E+0006 0.2565604063E+0006 
0.004536 0.2565349844E+0006 0.2565349844E+0006 0.2565349844E+0006 
0.007560 0.2565098906E+0006 0.2565098906E+0006 0.2565098906E+0006 
0.010583 0.2564848125E+0006 0.2564848125E+0006 0.2564848125E+0006 
0.013607 0.2564597656E+0006 0.2564597656E+0006 0.2564597656E+0006 

你有很多元素,所以可能是查看數據的最佳方式是在2D:

data = dfnew.astype(float).values 
extent = [float(dfnew.columns[0]), 
      float(dfnew.columns[-1]), 
      float(dfnew.index[0]), 
      float(dfnew.index[-1])] 
import matplotlib.pyplot as plt 
plt.imshow(data, extent=extent, origin='lower') 
plt.xlabel('Time') 
plt.ylabel('Y') 

順便說一句,它看起來像你的示例文件中每次P_g的所有值都是一樣的...

+0

這很好用!謝謝。如果您有時間,將每行繪製爲一條線的示例會很有幫助。 x軸應該是時間t,而y軸應該是壓力P_g。 – wigging 2015-02-12 17:48:39

+0

你真的想要420個獨立的行嗎?這可能不是最好的方式來看... – Ajean 2015-02-12 19:29:16

+0

@Gavin我添加了一些繪圖代碼。 420條個體會變得很討厭,所以我在2D中做到了。 – Ajean 2015-02-12 19:57:50

0

兩件事。首先,也許你可以考慮如何將它縮減爲2D電子表格。每列應該包含哪些列?我建議每行應包含Time,YP_g。也許這可以告訴你的處理你的時髦輸入格式的策略。

其次,爲什麼Y值是您試圖繪製P_g v.s. Time?你的數據似乎有3個變量 - 你需要減少到2個維度,以便創建一個2D圖。你想繪製一個特定的Time值的平均值P_g?或者你想要一個3d情節,你在哪裏繪製Y v.s. P_g每個Time的值?假設你採用上面建議的row/col結構,這些都可以用熊貓輕鬆完成。檢查大熊貓groupby功能。 Here's more detail on that

編輯:你已經澄清了我的兩個問題。試試這個:

import pandas, sys, numpy                                                               
if sys.version_info[0] < 3:                                                              
    from StringIO import StringIO                                                             
else:                                                                    
    from io import StringIO                                                              

# main dataframe                                                                 
df = pandas.DataFrame(columns=['Time','Y','P_g'])                                                         

text = open('ROP_s_SD.dat','r').read()                                                            
chunks = text.split("Time = ")                                                              
# ignore first chunk                                                                
chunks = chunks[1:]                                                                
for chunk in chunks:                                                                
    time_str, rest_str = chunk.split('\n',1)                                                          
    time = float(time_str)                                                               
    chunk_df = pandas.DataFrame.from_csv(StringIO(rest_str), sep=r'\s*', index_col=False)                                               
    chunk_df['Time'] = time                                                              
    # add new content to main dataframe                                                           
    df = df.append(chunk_df)                                                              
# you should now have a DataFrame with columns 'Time','Y','P_g'                                                     
assert sorted(df.columns) == ['P_g', 'Time', 'Y']                                                         

# iterate over unique values of time                                                            
times = sorted(list(set(df['Time'])))                                                            
assert len(times) == len(chunks)                                                             
for i,time in enumerate(times):                                                             
    chunk_data = df[df['Time'] == time]                                                           
    # plot or do whatever you'd like with each segment                                                        
    means = numpy.mean(chunk_data)                                                             
    stds = numpy.std(chunk_data)                                                             
    print 'Data for time %d (%0.4f): ' %(i, time)                                                         
    print means, stds 
+0

x軸是'時間',y軸是'P_g'。每個圖都是針對特定的「Y」值。 – wigging 2015-02-11 18:33:05

+0

在這種情況下,我認爲我的建議可行。找到獲取數據的方法,以便每行都有'時間','Y'和'P_g'。然後,您可以執行以下操作:1.獲取「Y」列的唯一值,以及2.對於每個唯一值「Y」,選擇合適的數據子集並繪製「時間」v.s. 'P_g' – sharshofski 2015-02-11 18:44:32

+0

這就是我想要做的,這就是爲什麼我問這個問題。我只是不知道如何在Python中做到這一點。 – wigging 2015-02-11 18:54:41