根據不同的值創建新列並計數它們

對不起，如果標題不夠清楚。讓我解釋我想達到的目標。根據不同的值創建新列並計數它們

我有這個數據框，我們稱之爲df。

id | Area 
A one 
A two 
A one 
B one 
B one 
C one 
C two 
D one 
D one 
D two 
D three

我想根據現有數據框中的值創建一個新的數據框。首先，我想在df中找到總共不同的id。防爆。 ID A有3個條目，B有2個條目等，然後創建一個新的數據框。

對於我們新的數據幀，姑且稱之爲df_new

id | count 
A 3 
B 2 
C 2 
D 4

接下來，我想基於在DF [「區」]值來創建一個新列，在這個例子中，DF [」區域']包含3個不同的值（一，二，三）。我想統計某個ID在哪個區域中的次數。例如，ID A已經在區域一中兩次，一次在區域二中，在三區域中爲零。然後，我會將這些值附加到一個稱爲1,2和3的新列中。

df_new：

id | count | one | two | three 
A 3  2  1  0 
B 2  2  0  0 
C 2  1  1  0 
D 4  2  1  1

我已經開發了自己的代碼產生df_new，但是我相信，大熊貓具有更好的功能來執行這種數據提取的。這是我的代碼。

#Read the data 
df = pd.read_csv('test_data.csv', sep = ',') 
df.columns = ['id', 'Area'] #Rename 
# Count a total number of Area by Id 
df_new = pd.DataFrame({'count' : df.groupby("id")["Area"].count()}) 
# Reset index 
df_new = df_new.reset_index() 
#For loop for counting and creating a new column for areas in df['Area'] 
for i in xrange(0, len(df)): 
    #Get the id 
    idx = df['id'][i] 
    #Get the areaname 
    area_name = str(df["Area"][i]) 
    #Retrieve the index of a particular id 
    current_index = df_new.loc[df_new['id'] == idx, ].index[0] 
    #If area name exists in a column 
    if area_name in df_new.columns: 
     #Then +1 at the Location of the idx (Index) 
     df_new[area_name][current_index] += 1 
    #If not exists in the columns 
    elif area_name not in df_new.columns: 
     #Create an empty one with zeros 
     df_new[area_name] = 0 
     #Then +1 at the location of the idx (Index) 
     df_new[area_name][current_index] += 1

代碼很長，很難閱讀。它也遭受警告：「一個值試圖在DataFrame的一個片段的副本上設置」。我想了解更多有關如何有效編寫此內容的信息。

謝謝

來源

2017-08-22 Niche.P

可以使用df.groupby.count用於爲第二，第一部分和pd.crosstab。然後，使用pd.concat加入EM：

In [1246]: pd.concat([df.groupby('id').count().rename(columns={'Area' : 'count'}),\ 
         pd.crosstab(df.id, df.Area)], 1) 
Out[1246]: 
    count one three two 
id       
A  3 2  0 1 
B  2 2  0 0 
C  2 1  0 1 
D  4 2  1 1

下面是一個使用df.groupby第一部分：

df.groupby('id').count().rename(columns={'Area' : 'count'}) 

    count 
id  
A  3 
B  2 
C  2 
D  4

這裏的第二部分與pd.crosstab：

pd.crosstab(df.id, df.Area) 

Area one three two 
id     
A  2  0 1 
B  2  0 0 
C  1  0 1 
D  2  1 1

對於第二部分，你也可以使用pd.get_dummies並做一個點積：

(pd.get_dummies(df.id).T).dot(pd.get_dummies(df.Area)) 

    one three two 
A 2  0 1 
B 2  0 0 
C 1  0 1 
D 2  1 1

來源

2017-08-22 02:49:09

哦哇，真是太棒了。謝謝，我會在7分鐘內提供您的答案。 –

還有一個問題，是否可以使用交叉表生成二進制數而不是計數？取而代之的是，如果某個ID已經去過那個區域，那麼只有1，而某個ID的0從來沒有去過那裏？ –

@ Niche.P好的，我明白了。它是：'pd.crosstab（df.id，df.Area）.astype（bool）.astype（int）' –

根據不同的值創建新列並計數它們

回答

相關問題