2014-05-08 351 views
0

我在嘗試計算熊貓DataFrame中的重複行數。我讀取csv文件,看起來像這樣加速熊貓聚合

feature, IV, IT 
early/J_result/N, True, False 
early/J_result/N, True, False 
early/J_result/N, True, False 
excellent/J_result/N, True, True 
hillsdown/N, True, False 
hillsdown/N, True, False 

對於上面的例子中輸入所希望的輸出數據被

feature, IV, IT, count 
early/J_result/N, True, False, 3 
excellent/J_result/N, True, True, 1 
hillsdown/N, True, False, 2 

我現在的代碼是:

import pandas as pd 
def sum_up_token_counts(hdf_file): 
    df = pd.read_csv(csv_file, sep=', ') 
    counts = df.groupby('feature').count().feature 
    assert counts.sum() == df.shape[0] # no missing rows 
    df = df.drop_duplicates() 
    df.set_index('feature', inplace=True) 
    df['count'] = counts 
    return df 

該作品不出所料,但需要很長時間。我對它進行了描述,看起來幾乎所有的時間都花在了分組和計數上。

Total time: 4.43439 s 

Line #  Hits   Time Per Hit % Time Line Contents 
============================================================== 
    28           
    29   1  57567 57567.0  1.3  df = pd.read_csv(hdf_file, sep=', ') 
    30   1  4368529 4368529.0  98.5  counts = df.groupby('feature').count().feature 
    31   1   174 174.0  0.0  assert counts.sum() == df.shape[0] # no missing rows 
    32   1   6234 6234.0  0.1  df = df.drop_duplicates() 
    33   1   501 501.0  0.0  df.set_index('feature', inplace=True) 
    34   1   1377 1377.0  0.0  df['count'] = counts 
    35   1   1  1.0  0.0  return df 

任何想法如何加快這段代碼?

回答

2

使用主/ 0.14(即將公佈),大大加快了計數,看到here

這裏的/在主基準0.14 VS 0.13.1:

設置

In [1]: n = 10000 

In [2]: offsets = np.random.randint(n, size=n).astype('timedelta64[ns]') 

In [3]: dates = np.datetime64('now') + offsets 

In [4]: dates[np.random.rand(n) > 0.5] = np.datetime64('nat') 

In [5]: offsets[np.random.rand(n) > 0.5] = np.timedelta64('nat') 

In [6]: value2 = np.random.randn(n) 

In [7]: value2[np.random.rand(n) > 0.5] = np.nan 

In [8]: obj = pd.util.testing.choice(['a', 'b'], size=n).astype(object) 

In [9]: obj[np.random.randn(n) > 0.5] = np.nan 

In [10]: df = DataFrame({'key1': np.random.randint(0, 500, size=n), 
    ....:     'key2': np.random.randint(0, 100, size=n), 
    ....:     'dates': dates, 
    ....:     'value2' : value2, 
    ....:     'value3' : np.random.randn(n), 
    ....:     'obj': obj, 
    ....:     'offsets': offsets}) 

V0。 13.1

In [11]: %timeit df.groupby(['key1', 'key2']).count() 
1 loops, best of 3: 5.41 s per loop 

v0.14.0

In [11]: %timeit df.groupby(['key1', 'key2']).count() 
100 loops, best of 3: 6.25 ms per loop 
+0

這絕對有所作爲,謝謝! – mbatchkarov

+0

是的,'count'發生在Python-land中,我實現了這個功能,因爲我計算了700k組,並且花了幾分鐘才完成。很高興這可以幫助你! –