固定寬度的慢解析，交替行文件大熊貓數據幀

我寫了一個函數來分析this wind file (wind.txt ~1MB)成大熊貓數據幀，但它很慢，因爲文件格式的污穢（根據我的同事）。以上鍊接的文件僅僅是具有每小時風速數據較大的文件的一個子集，從1900年到2016年，這裏是該文件的一個片段：固定寬度的慢解析，交替行文件大熊貓數據幀

2000 1 1 CCB Wdir 5 11 15 14 14 14 14 16 15 15 15 15 13 12 16 16 15 15 15 15 15 14 14 14 
2000 1 1 CCB Wspd 10 8 6 8 7 7 8 8 6 8 9 7 16 16 7 10 12 14 15 17 18 22 22 20 
2000 1 2 CCB Wdir 14 14 14 14 14 16 16 16 16 15 15 16 17 17 16 17 16 16 16 15 15 15 15 16 
2000 1 2 CCB Wspd 17 16 15 17 15 15 16 14 14 15 17 16 15 13 14 15 15 21 20 20 18 25 23 21 
2000 1 3 CCB Wdir 15 15 15 16 15 16 16 16 16 16 16 20 18 22 28 27 26 31 32 32 33 33 35 33 
2000 1 3 CCB Wspd 20 22 22 18 20 21 21 22 18 16 14 13 15 6 3 7 8 8 13 13 15 10 6 7

的列是年，月，日，網站名，變量名，00小時，01小時，02小時，...，小時23風向和風速出現在交替的每一天線和24個小時測量爲一個單一的一天都在同一行上。

我在做什麼是閱讀本文件的內容與日期時間指數（每小時頻率）和兩列（WDIR和WSPD）的單個熊貓數據幀。我解析器是如下：

import pandas as pd 
from datetime import timedelta 

fil = 'D:\\wind.txt' 
lines = open(fil, 'r').readlines() 
nl = len(lines) 

wdir = lines[0:nl:2] 
wspd = lines[1:nl:2] 

first = wdir[0].split() 
start = pd.datetime(int(first[0]), int(first[1]), int(first[2]), 0) 
last = wdir[-1].split() 
end = pd.datetime(int(last[0]), int(last[1]), int(last[2]), 23) 
drange = pd.date_range(start, end, freq='H') 

wind = pd.DataFrame(pd.np.nan, index=drange, columns=['wdir','wspd']) 

idate = start 

for d in range(nl/2): 
    dirStr = wdir[d].split() 
    spdStr = wspd[d].split() 
    for h in range(24): 
     if dirStr[h+5] != '-9' and spdStr[h+5] != '-9': 
      wind.wdir[idate] = int(dirStr[h+5]) * 10 
      wind.wspd[idate] = int(spdStr[h+5]) 
     idate += timedelta(hours=1) 
     if idate.month == 1 and idate.day == 1 and idate.hour == 1: 
      print idate

現在它大約需要2.5秒，解析一個單一的一年，我認爲是相當不錯的，但是我的同事認爲，它應該能夠分析在整個數據文件幾秒鐘。他對嗎？我浪費寶貴的時間寫出緩慢而笨拙的解析器嗎？

我在一個巨大的，傳統的FORTRAN77模型工作，我有各種輸入/輸出文件幾十個類似的解析器能夠分析/創建/修改它們的蟒蛇。如果我可以節省每一個時間，我想知道如何。非常感謝！

來源

2017-05-03 Taylor

如果你的代碼已經工作，你可能會更好地在CodeReview上發佈這個代碼 - SO代碼中的問題更多。 – asongtoruin

我會使用pd.read_fwf(...)或pd.read_csv(..., delim_whitespace=True)方法 - 它的目的是分析這些文件...

演示：

cols = ['year', 'month', 'day', 'site', 'var'] + ['{:02d}'.format(i) for i in range(24)] 

fn = r'C:\Temp\.data\43763897.txt' 

df = pd.read_csv(fn, names=cols, delim_whitespace=True, na_values=['-9']) 
x = pd.melt(df, 
      id_vars=['year','month','day','site','var'], 
      value_vars=df.columns[5:].tolist(), 
      var_name='hour') 
x['date'] = pd.to_datetime(x[['year','month','day','hour']], errors='coerce') 
x = (x.drop(['year','month','day','hour'], 1) 
     .pivot_table(index=['date','site'], columns='var', values='value') 
     .reset_index())

結果：

In [12]: x 
Out[12]: 
var     date site Wdir Wspd 
0  2000-01-01 00:00:00 CCB 5.0 10.0 
1  2000-01-01 01:00:00 CCB 11.0 8.0 
2  2000-01-01 02:00:00 CCB 15.0 6.0 
3  2000-01-01 03:00:00 CCB 14.0 8.0 
4  2000-01-01 04:00:00 CCB 14.0 7.0 
5  2000-01-01 05:00:00 CCB 14.0 7.0 
6  2000-01-01 06:00:00 CCB 14.0 8.0 
7  2000-01-01 07:00:00 CCB 16.0 8.0 
8  2000-01-01 08:00:00 CCB 15.0 6.0 
9  2000-01-01 09:00:00 CCB 15.0 8.0 
...     ... ... ... ... 
149030 2016-12-31 14:00:00 CCB 0.0 0.0 
149031 2016-12-31 15:00:00 CCB 1.0 5.0 
149032 2016-12-31 16:00:00 CCB 33.0 8.0 
149033 2016-12-31 17:00:00 CCB 34.0 9.0 
149034 2016-12-31 18:00:00 CCB 35.0 7.0 
149035 2016-12-31 19:00:00 CCB 0.0 0.0 
149036 2016-12-31 20:00:00 CCB 12.0 8.0 
149037 2016-12-31 21:00:00 CCB 13.0 7.0 
149038 2016-12-31 22:00:00 CCB 15.0 7.0 
149039 2016-12-31 23:00:00 CCB 17.0 7.0 

[149040 rows x 4 columns]

定時與wind.txt文件：

In [10]: %%timeit 
    ...: cols = ['year', 'month', 'day', 'site', 'var'] + ['{:02d}'.format(i) for i in range(24)] 
    ...: fn = r'D:\download\wind.txt' 
    ...: df = pd.read_csv(fn, names=cols, delim_whitespace=True, na_values=['-9']) 
    ...: x = pd.melt(df, 
    ...:    id_vars=['year','month','day','site','var'], 
    ...:    value_vars=df.columns[5:].tolist(), 
    ...:    var_name='hour') 
    ...: x['date'] = pd.to_datetime(x[['year','month','day','hour']], errors='coerce') 
    ...: x = (x.drop(['year','month','day','hour'], 1) 
    ...:  .pivot_table(index=['date','site'], columns='var', values='value') 
    ...:  .reset_index()) 
    ...: 
1 loop, best of 3: 812 ms per loop

來源

2017-05-03 15:11:35 MaxU

我已經嘗試使用pd.read_fwf（）爲這個和其他應用程序。我遇到的問題是（1）交替的方向/速度行實際上很難解決（2）將每個小時測量列都劃分爲它們自己的唯一索引行非常困難，（3）因爲沒有實際的小時字段，自動解析日期幾乎是不可能的。雖然可以使用pd.read_fwf（）來處理這個輸入文件，但它比上面描述的過程慢得多，而且要複雜得多......除非我做錯了什麼，這絕對是一種可能性。 – Taylor

@Taylor，你是否需要在生成的DF中的「站點名稱」列？ – MaxU

使用MaxU的代碼和你的wind.txt，我得到了一個3.5s的298080行。只需將錯誤='強制'添加到pd.to_datetime。 –

固定寬度的慢解析，交替行文件大熊貓數據幀

回答

相關問題