2017-08-28 59 views
2

我有一個2列DataFrame,第1列對應於客戶,第2列對應於此客戶已訪問的城市。數據框如下所示:Python Pandas:如何將對映射列表轉換爲行向量格式?

print(df) 

    customer visited_city 
0 John  London 
1 Mary  Melbourne 
2 Steve  Paris 
3 John  New_York 
4 Peter  New_York 
5 Mary  London 
6 John  Melbourne 
7 John  New_York 

想上面的數據幀轉換爲行矢量格式,使得每行代表與指示城市走訪了行向量的唯一用戶。

print(wide_format_df) 

      London Melbourne New_York Paris 
John  1.0  1.0  1.0  0.0 
Mary  1.0  1.0  0.0  0.0 
Steve  0.0  0.0  0.0  1.0 
Peter  0.0  0.0  1.0  0.0 

下面是我用來生成寬格式的代碼。它逐個遍歷每個用戶。我想知道還有沒有更有效的方法呢?

import pandas as pd 
import numpy as np 

UNIQUE_CITIESS = np.sort(df['visited_city'].unique()) 
p = len(UNIQUE_CITIESS) 
unique_customers = df['customer'].unique().tolist() 

X = [] 
for customer in unique_customers: 
    x = np.zeros(p)  
    city_visited = np.sort(df[df['customer'] == customer]['visited_city'].unique()) 
    visited_idx = np.searchsorted(UNIQUE_CITIESS, city_visited) 
    x[visited_idx] = 1  
    X.append(x) 
wide_format_df = pd.DataFrame(np.array(X), columns=UNIQUE_CITIESS, index=unique_customers) 
wide_format_df 

回答

3

請注意,你的問題已經被編輯過這樣的答案不再提供答案 你的問題。儘管他曾經去過兩次,但他們必須調整1JohnNew York

選項1 pir1
我喜歡這個答案,因爲我認爲這是優雅的。

pd.get_dummies(df.customer).T.dot(pd.get_dummies(df.visited_city)).clip(0, 1) 

     London Melbourne New_York Paris 
John  1   1   1  0 
Mary  1   1   0  0 
Peter  0   0   1  0 
Steve  0   0   0  1 

選項2 pir2
這個答案應該是快。

i, r = pd.factorize(df.customer.values) 
j, c = pd.factorize(df.visited_city.values) 
n, m = r.size, c.size 
b = np.zeros((n, m), dtype=int) 
b[i, j] = 1 

pd.DataFrame(b, r, c).sort_index().sort_index(1) 

     London Melbourne New_York Paris 
John  1   1   1  0 
Mary  1   1   0  0 
Peter  0   0   1  0 
Steve  0   0   0  1 

選項3 pir3
實用和漂亮的快速

df.groupby(['customer', 'visited_city']).size().unstack(fill_value=0).clip(0, 1) 

visited_city London Melbourne New_York Paris 
customer           
John    1   1   1  0 
Mary    1   1   0  0 
Peter    0   0   1  0 
Steve    0   0   0  1 

定時
下面的代碼

# Multiples of Minimum time 
# 
      pir1 pir2  pir3  wen  vai 
10  1.392237 1.0 1.521555 4.337469 5.569029 
30  1.445762 1.0 1.821047 5.977978 7.204843 
100 1.679956 1.0 1.901502 6.685429 7.296454 
300 1.568407 1.0 1.825047 5.556880 7.210672 
1000 1.622137 1.0 1.613983 5.815970 5.396008 
3000 1.808637 1.0 1.852953 4.159305 4.224724 
10000 1.654354 1.0 1.502092 3.145032 2.950560 
30000 1.555574 1.0 1.413612 2.404061 2.299856 

enter image description here

wen = lambda d: d.pivot_table(index='customer', columns='visited_city',aggfunc=len, fill_value=0) 
vai = lambda d: pd.crosstab(d.customer, d.visited_city) 
pir1 = lambda d: pd.get_dummies(d.customer).T.dot(pd.get_dummies(d.visited_city)).clip(0, 1) 
pir3 = lambda d: d.groupby(['customer', 'visited_city']).size().unstack(fill_value=0).clip(0, 1) 

def pir2(d): 
    i, r = pd.factorize(d.customer.values) 
    j, c = pd.factorize(d.visited_city.values) 
    n, m = r.size, c.size 
    b = np.zeros((n, m), dtype=int) 
    b[i, j] = 1 

    return pd.DataFrame(b, r, c).sort_index().sort_index(1) 

results = pd.DataFrame(
    index=[10, 30, 100, 300, 1000, 3000, 10000, 30000], 
    columns='pir1 pir2 pir3 wen vai'.split(), 
    dtype=float 
) 

for i in results.index: 
    d = pd.concat([df] * i, ignore_index=True) 
    for j in results.columns: 
     stmt = '{}(d)'.format(j) 
     setp = 'from __main__ import d, {}'.format(j) 
     results.at[i, j] = timeit(stmt, setp, number=10) 

print((lambda r: r.div(r.min(1), 0))(results)) 

results.plot(loglog=True) 
+0

我錯過了這一個動作,有趣的剪輯使用。 +1 – Vaishali

+0

謝謝@Vaishali – piRSquared

+0

夢幻般的答案,謝謝@piRSquared !!! – cwl

3

您可以使用交叉

pd.crosstab(df.customer, df.visited_city) 

你得到

visited_city London Melbourne New_York Paris 
customer     
John   1  1   1   0 
Mary   1  1   0   0 
Peter   0  0   1   0 
Steve   0  0   0   1 
+0

這是一個好主意,但問題是,潛在的一對給定會在原來的數據幀中多次出現,這樣做'crosstab'將導致計數,而不是指標向量。 – cwl

+0

實際上,我認爲可以使用'df.drop_duplicates()'去除原始DataFrame中的重複行,所以'crosstab'應該足夠好,謝謝@Vaishali! – cwl

+0

cwl,你可能想看看@piRSquared答案,它解決了重複問題,並且效率更高。 – Vaishali

2

,也可以使用

df.pivot_table(index='customer', columns='visited_city',aggfunc=len, fill_value=0) 

visited_city London Melbourne New_York Paris 
customer           
John    1   1   1  0 
Mary    1   1   0  0 
Peter    0   0   1  0 
Steve    0   0   0  1 
相關問題