0
我在嘗試計算熊貓DataFrame
中的重複行數。我讀取csv文件,看起來像這樣加速熊貓聚合
feature, IV, IT
early/J_result/N, True, False
early/J_result/N, True, False
early/J_result/N, True, False
excellent/J_result/N, True, True
hillsdown/N, True, False
hillsdown/N, True, False
對於上面的例子中輸入所希望的輸出數據被
feature, IV, IT, count
early/J_result/N, True, False, 3
excellent/J_result/N, True, True, 1
hillsdown/N, True, False, 2
我現在的代碼是:
import pandas as pd
def sum_up_token_counts(hdf_file):
df = pd.read_csv(csv_file, sep=', ')
counts = df.groupby('feature').count().feature
assert counts.sum() == df.shape[0] # no missing rows
df = df.drop_duplicates()
df.set_index('feature', inplace=True)
df['count'] = counts
return df
該作品不出所料,但需要很長時間。我對它進行了描述,看起來幾乎所有的時間都花在了分組和計數上。
Total time: 4.43439 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
28
29 1 57567 57567.0 1.3 df = pd.read_csv(hdf_file, sep=', ')
30 1 4368529 4368529.0 98.5 counts = df.groupby('feature').count().feature
31 1 174 174.0 0.0 assert counts.sum() == df.shape[0] # no missing rows
32 1 6234 6234.0 0.1 df = df.drop_duplicates()
33 1 501 501.0 0.0 df.set_index('feature', inplace=True)
34 1 1377 1377.0 0.0 df['count'] = counts
35 1 1 1.0 0.0 return df
任何想法如何加快這段代碼?
這絕對有所作爲,謝謝! – mbatchkarov
是的,'count'發生在Python-land中,我實現了這個功能,因爲我計算了700k組,並且花了幾分鐘才完成。很高興這可以幫助你! –