2016-07-25 151 views
2

我有數據看起來像下面的文件a.dat:解析數據

01/Jul/2016 00:05:09  8438.2 
01/Jul/2016 00:05:19  8422.4 g 

我希望把它們解析成三列:時間表,浮點數,字符串(無或g)

我曾嘗試:

df=pd.read_csv('a.dat',sep='  | ',engine='python') 

,其與4列結束了:日期,時間,浮動和g

df=pd.read_csv('a.dat',sep='  | (g)',engine='python') 

其給出5列與第1列和4的NaN

有沒有更好的方式來創建沒有任何後處理的datafram?

回答

2

您可以使用read_csv

import pandas as pd 
import io 

temp=u'''01/Jul/2016 00:05:09  8438.2 
01/Jul/2016 00:05:19  8422.4 g''' 
#after testing replace io.StringIO(temp) to filename 
df = pd.read_csv(io.StringIO(temp), 
       sep='\s+', 
       names=['date','time','float','string'], 
       parse_dates=[['date','time']]) 
print (df) 
      date_time float string 
0 2016-07-01 00:05:09 8438.2 NaN 
1 2016-07-01 00:05:19 8422.4  g 

或者:

import pandas as pd 
import io 

temp=u'''01/Jul/2016 00:05:09  8438.2 
01/Jul/2016 00:05:19  8422.4 g''' 
#after testing replace io.StringIO(temp) to filename 
df = pd.read_csv(io.StringIO(temp), 
       delim_whitespace=True, 
       names=['date','time','float','string'], 
       parse_dates=[['date','time']]) 
print (df) 
      date_time float string 
0 2016-07-01 00:05:09 8438.2 NaN 
1 2016-07-01 00:05:19 8422.4  g 

解決方案與read_fwf

import pandas as pd 
import io 

temp=u'''01/Jul/2016 00:05:09  8438.2 
01/Jul/2016 00:05:19  8422.4 g''' 
#after testing replace io.StringIO(temp) to filename 
df = pd.read_fwf(io.StringIO(temp), 
       names=['date','time','float','string'], 
       parse_dates=[['date','time']]) 
print (df) 
      date_time float string 
0 2016-07-01 00:05:09 8438.2 NaN 
1 2016-07-01 00:05:19 8422.4  g 

你也可以指定列的寬度:

df = pd.read_fwf(io.StringIO(temp), 
       fwidths = [20,12,2], 
       names=['date','time','float','string'], 
       parse_dates=[['date','time']]) 
print (df) 
      date_time float string 
0 2016-07-01 00:05:09 8438.2 NaN 
1 2016-07-01 00:05:19 8422.4  g