2017-02-02 68 views
2

我有以下的數據幀my_df:大熊貓:二進制編碼在大熊貓列的一組值

Name  cards 
------------------ 
John  {A,B} 
Mary  {B,C,A} 
Dan  {D,A} 
Peter  {C,A} 
Ed  {A,C,D} 

和我想要做的該組值的二進制編碼,即,I所要的輸出像:

Name  Card_A Card_B Card_C Card_D 
-------------------------------------------- 
John  1   1   0  0 
Mary  1   1   1  0 
Dan  1   0   0  1 
Peter  1   0   1  0 
Ed  1   0   1  1 

是否有一個現有的Python包?或者什麼是實現這個目標的最好方法?謝謝!

回答

3

如果cards柱是set小號

df = pd.DataFrame({'Name':['John','Mary','Dan','Peter','Ed'], 
        'cards':[set(['A','B']), set(['B','C','A']), 
          set(['D','A']), set(['C','A']), set(['A','C','D'])]}) 


df[['Name']].join(
    df.cards.apply(
     lambda x: pd.value_counts(list(x)) 
    ).fillna(0).astype(int).add_prefix('Card_') 
) 

enter image description here


如果cardsstr
只是爲了展示與str.extractall

解析與str.extractall分析它,並applyvalue_counts

df[['Name']].join(
    df.cards.str.extractall('([^\{\}, ]+)')[0].groupby(level=0).apply(
     pd.value_counts).unstack(fill_value=0).add_prefix('Card_') 
) 

enter image description here

3

首先將set秒轉換爲str並且通過strip刪除{}

Then str.get_dummies

最後add_prefix

df = pd.DataFrame({'Name':['John','Mary','Dan','Peter','Ed'], 
        'cards':[set(['A','B']), set(['B','C','A']), 
          set(['D','A']), set(['C','A']), set(['A','C','D'])]}) 

print (df) 
    Name  cards 
0 John  {A, B} 
1 Mary {A, C, B} 
2 Dan  {A, D} 
3 Peter  {A, C} 
4  Ed {A, D, C} 

df.cards = df.cards.astype(str).str.strip('{}') 
df = df.set_index('Name').cards.str.get_dummies(', ') 
df.columns = df.columns.str.strip("'") 
df = df.add_prefix('Card_').reset_index() 

print (df) 
    Name Card_A Card_B Card_C Card_D 
0 John  1  1  0  0 
1 Mary  1  1  1  0 
2 Dan  1  0  0  1 
3 Peter  1  0  1  0 
4  Ed  1  0  1  1 

另一種替代的解決方案:

def f(category_list): 
    n_categories = len(category_list) 
    return pd.Series(dict(zip(category_list, [1]*n_categories))) 

df1 = df.set_index('Name').cards 
     .apply(f) 
     .add_prefix('Card_') 
     .fillna(0) 
     .astype(int) 
     .reset_index() 

print (df1) 
    Name Card_A Card_B Card_C Card_D 
0 John  1  1  0  0 
1 Mary  1  1  1  0 
2 Dan  1  0  0  1 
3 Peter  1  0  1  0 
4  Ed  1  0  1  1