0
如果我groupby(下面的g對象),然後將下面的函數應用於df的前1000行,它的工作原理。但是,如果我把它應用到整個DF,我得到這個異常:熊貓適用於數據幀組
def calc_load(x):
...: x.sort('log_timestamp')
...: x['time_stddev'] = x['time'].std()
...: x['time_mean'] = x['time'].mean()
...: return x
...:
c=g.apply(calc_load)
---------------------------------------------------------------------------
........
ValueError Traceback (most recent call last)
<ipython-input-262-f2fe1f013907> in <module>()
----> 1 c=g.apply(calc_load)
2215 tuple(map(int, [tot_items] + list(block_shape))),
-> 2216 tuple(map(int, [len(ax) for ax in axes]))))
2217
2218
ValueError: Shape of passed values is (10, 3943482), indices imply (10, 410450)
這裏有什麼原因,我該如何解決呢?
UPDATE:
我從HDF5存儲器讀取這個表是這樣的:
prob2
Out[374]:
<class 'pandas.io.pytables.HDFStore'>
File path: /tmp/test2.h5
/mytable frame_table (typ->appendable,nrows->410450,ncols->8,indexers->[index])
a=prob2.mytable
a
Out[376]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 410450 entries, 0 to 9999
Data columns (total 8 columns):
args 410450 non-null values
host 410450 non-null values
kwargs 410450 non-null values
log_timestamp 410450 non-null values
operation 410450 non-null values
slot 410450 non-null values
status 410450 non-null values
time 410450 non-null values
dtypes: float64(1), int64(2), object(5)
如果我做往返CSV像下面,異常不會發生:
a.to_csv('/tmp/test2.csv')
b=pd.read_csv('/tmp/test2.csv')
b
Out[379]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 410450 entries, 0 to 410449
Data columns (total 9 columns):
Unnamed: 0 410450 non-null values
args 410450 non-null values
host 410450 non-null values
kwargs 410450 non-null values
log_timestamp 410450 non-null values
operation 410450 non-null values
slot 410450 non-null values
status 410450 non-null values
time 410450 non-null values
dtypes: float64(1), int64(3), object(5)
bg = b.groupby(['host','operation'])
bg.apply(calc_load)
Out[381]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 410450 entries, 0 to 410449
Data columns (total 11 columns):
Unnamed: 0 410450 non-null values
args 410450 non-null values
host 410450 non-null values
kwargs 410450 non-null values
log_timestamp 410450 non-null values
operation 410450 non-null values
slot 410450 non-null values
status 410450 non-null values
time 410450 non-null values
time_stddev 410371 non-null values
time_mean 410450 non-null values
dtypes: float64(3), int64(3), object(5)
往返(a)和往返(b)之前的數據幀看起來相似,但它們不完全相同!
a
Out[386]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 410450 entries, 0 to 9999
Data columns (total 8 columns):
args 410450 non-null values
host 410450 non-null values
kwargs 410450 non-null values
log_timestamp 410450 non-null values
operation 410450 non-null values
slot 410450 non-null values
status 410450 non-null values
time 410450 non-null values
dtypes: float64(1), int64(2), object(5)
b
Out[387]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 410450 entries, 0 to 410449
Data columns (total 9 columns):
Unnamed: 0 410450 non-null values
args 410450 non-null values
host 410450 non-null values
kwargs 410450 non-null values
log_timestamp 410450 non-null values
operation 410450 non-null values
slot 410450 non-null values
status 410450 non-null values
time 410450 non-null values
dtypes: float64(1), int64(3), object(5)
呃,這是怎麼回事?
您需要提供一個工作的例子,也許還可以利用Dropbox的提供您的幀(或創建一個例子來說明錯誤) – Jeff
@Jeff,它在UPDATE。並感謝一百萬次的幫助! – LetMeSOThat4U
你可以做''df.head()''所以可以看到值。看起來像你有一個類似字符串的列(標記)爲對象dtype。對象的dtypes只能用於類字符串。您可能需要進行一些轉換(甚至在將其放入HDF5之前)。數據來自哪一步? – Jeff