計算大熊貓分類的「濃度」

我對計算大熊貓分類列濃度的舊函數有問題。似乎已經發生了一些變化，無法對分類系列的方法的結果進行子集分組。計算大熊貓分類的「濃度」

最小非工作例如：

import pandas as pd 
import numpy as np 

df = pd.DataFrame({"A":["a","b","c","a"]}) 

def get_concentration(df,cat): 
    tmp = df[cat].astype("category") 
    counts = tmp.value_counts() 
    obs = len(tmp) 
    all_cons = [] 
    for key in counts.keys(): 
     single = np.square(np.divide(float(counts[key]),float(obs))) 
     all_cons.append(single) 
     return np.sum(all_cons) 

get_concentration(df, "A")

這導致counts["a"]一個關鍵的錯誤。我很確定這在熊貓的過去版本中起作用，並且文檔似乎沒有提及關於.value_counts()方法的更改。

來源

2016-02-02 Matthias

我補充說，也不需要''categorical'一個dtype'簡化和矢量版本。 – Stefan

在其他問題中，return語句應該在for循環之外。 – Alexander

既然你在一個循環中迭代（而不是以矢量方式工作），你可能只是明確地迭代對。它簡化了語法，恕我直言：

import pandas as pd 
import numpy as np 

df = pd.DataFrame({"A":["a","b","c","a"]}) 

def get_concentration(df,cat): 
    tmp = df[cat].astype("category") 
    counts = tmp.value_counts() 
    obs = len(tmp) 
    all_cons = [] 
    # See change in following line - you're anyway iterating 
    # over key-value pairs; why not do so explicitly? 
    for k, v in counts.to_dict().items(): 
     single = np.square(np.divide(float(v),float(obs))) 
     all_cons.append(single) 
     return np.sum(all_cons) 

>>> get_concentration(df, "A") 
0.25

來源

2016-02-02 15:29:13

這似乎是最自然的解決方案！謝謝:) – Matthias

@Matthias你非常歡迎。但是，我建議您將數字代碼中的循環看作「紅旗」。 –

要修復當前功能，只需使用.ix（參見下文）訪問index值。你可能會更好地使用向量化函數 - 我最後加了一個。

df = pd.DataFrame({"A":["a","b","c","a"]}) 

tmp = df[cat].astype('category') 
counts = tmp.value_counts() 
obs = len(tmp) 
all_cons = [] 
for key in counts.index: 
    single = np.square(np.divide(float(counts.ix[key]), float(obs))) 
    all_cons.append(single) 
    return np.sum(all_cons)

產量：

get_concentration(df, "A") 

0.25

你可能想嘗試向量化版本，這也並不一定需要categorydtype，如：

def get_concentration(df, cat): 
    counts = df[cat].value_counts() 
    return counts.div(len(counts)).pow(2).sum()

來源

2016-02-02 15:26:54 Stefan

讓我們在方法論同意：

>>> df.A.value_counts() 
a 2 
b 1 
c 1 

obs = len((df['A'].astype('category')) 
>>> obs 
4

濃度應爲（每Herfindahl Index）如下：

>>> (2/4.) ** 2 + (1/4.) ** 2 + (1/4.) ** 2 
0.375

即相當於（熊貓0.17+）：

>>> ((df.A.value_counts()/df.A.count()) ** 2).sum() 
0.375

如果你真的想要一個功能：

def concentration(df, col): 
    return ((df[col].value_counts()/df[col].count()) ** 2).sum() 

>>> concentration(df, 'A') 
0.375

來源

2016-02-02 16:20:12 Alexander

計算大熊貓分類的「濃度」

回答

相關問題