假設我有以下數據:快速的方法來計算所有值出現在熊貓數據幀
import pandas as pd
import numpy as np
import random
from string import ascii_uppercase
random.seed(100)
n = 1000000
# Create a bunch of factor data... throw some NaNs in there for good measure
data = {letter: [random.choice(list(ascii_uppercase) + [np.nan]) for _ in range(n)] for letter in ascii_uppercase}
df = pd.DataFrame(data)
我想快速算在集合中的所有值的每個值的全球發生數據幀。
這工作:
from collections import Counter
c = Counter([v for c in df for v in df[c].fillna(-999)])
但速度很慢:
%timeit Counter([v for c in df for v in df[c].fillna(-999)])
1 loop, best of 3: 4.12 s per loop
我想這個功能可以通過使用一些大熊貓的馬力加快速度:
def quick_global_count(df, na_value=-999):
df = df.fillna(na_value)
# Get counts of each element for each column in the passed dataframe
group_bys = {c: df.groupby(c).size() for c in df}
# Stack each of the Series objects in `group_bys`... This is faster than reducing a bunch of dictionaries by keys
stacked = pd.concat([v for k, v in group_bys.items()])
# Call `reset_index()` to access the index column, which indicates the factor level for each column in dataframe
# Then groupby and sum on that index to get global counts
global_counts = stacked.reset_index().groupby('index').sum()
return global_counts
這絕對更快(以前方法的75%),但必須有更快的速度......
個%timeit quick_global_count(df)
10 loops, best of 3: 3.01 s per loop
上述兩種方法的結果是相同的(與結果稍加修改通過quick_global_count
返回):
dict(c) == quick_global_count(df).to_dict()[0]
True
中有什麼數據框計算值的全球發生率的更快的方法?
那麼,數據總是單個字符的大寫還是NaN? – Divakar
是的,讓我們假設這個練習。如果上述數據與您想到的任何數據之間的方法存在顯着差異,那麼提供這兩個示例可能是值得追求的。 – blacksite