2016-11-02 79 views
3

我有一個兩列的數據集。第一列包含唯一的用戶ID,第二列包含連接到這些ID的屬性。計數兩個字符串之間唯一的ID重疊

例如:

------------------------ 
User ID  Attribute 
------------------------ 
1234  blond 
1235  brunette 
1236  blond 
1234  tall 
1235  tall 
1236  short 
------------------------ 

我想知道的是屬性之間的相關性。在上面的例子中,我想知道一個金髮碧眼的人多麼高。我期望的輸出是:

------------------------------ 
Attr 1  Attr 2  Overlap 
------------------------------ 
blond  tall   1 
blond  short  1 
brunette tall   1 
brunette short  0 
------------------------------ 

我試着用熊貓來透視數據,並獲得輸出,但由於我的數據集有數百個屬性的,我現在的嘗試是不可行的。

df = pandas.read_csv('myfile.csv')  

df.pivot_table(index='User ID', columns'Attribute', aggfunc=len, fill_value=0) 

我的電流輸出:

-------------------------------- 
Blond Brunette Short Tall 
-------------------------------- 
    0  1   0  1 
    1  0   0  1 
    1  0   1  0 
-------------------------------- 

是否有一種方式來獲得我想要的輸出?提前致謝。

+1

我認爲你的第一步應該是把它變成更好的關係秩序。這些屬性沒有邏輯分成頭髮顏色/高度屬性 – brianpck

+0

確實!我試了一個答案,但不能做出這些區別 –

回答

1

您不需經過使用itertools product尋找每一個可能的屬性對夫婦,然後在此匹配行:

import pandas as pd 
from itertools import product 

# 1) creating pandas dataframe 
df = [ ["1234" , "blond"], 
     ["1235" , "brunette"], 
     ["1236" , "blond" ], 
     ["1234" , "tall"], 
     ["1235" , "tall"], 
     ["1236" , "short"]] 

df = pd.DataFrame(df) 
df.columns = ["id", "attribute"] 

#2) creating all the possible attributes binomes 
attributs = set(df.attribute) 
for attribut1, attribut2 in product(attributs, attributs): 
    if attribut1!=attribut2: 
     #3) selecting the rows for each attribut 
     df1 = df[df.attribute == attribut1]["id"] 
     df2 = df[df.attribute == attribut2]["id"] 
     #4) finding the ids that are matching both attributs 
     intersection= len(set(df1).intersection(set(df2))) 
     if intersection: 
      #5) displaying the number of matches 
      print attribut1, attribut2, intersection 

捐贈:

tall brunette 1 
tall blond 1 
brunette tall 1 
blond tall 1 
blond short 1 
short blond 1 

編輯

它是那麼容易細化到得到你想要的輸出:

import pandas as pd 
from itertools import product 

# 1) creating pandas dataframe 
df = [ ["1234" , "blond"], 
     ["1235" , "brunette"], 
     ["1236" , "blond" ], 
     ["1234" , "tall"], 
     ["1235" , "tall"], 
     ["1236" , "short"]] 

df = pd.DataFrame(df) 
df.columns = ["id", "attribute"] 

wanted_attribute_1 = ["blond", "brunette"] 

#2) creating all the possible attributes binomes 
attributs = set(df.attribute) 
for attribut1, attribut2 in product(attributs, attributs): 
    if attribut1 in wanted_attribute_1 and attribut2 not in wanted_attribute_1: 
     if attribut1!=attribut2: 
      #3) selecting the rows for each attribut 
      df1 = df[df.attribute == attribut1]["id"] 
      df2 = df[df.attribute == attribut2]["id"] 
      #4) finding the ids that are matching both attributs 
      intersection= len(set(df1).intersection(set(df2))) 
      #5) displaying the number of matches 
      print attribut1, attribut2, intersection 

捐贈:

brunette tall 1 
brunette short 0 
blond tall 1 
blond short 1 
+0

謝謝。這給了我正在尋找的輸出。我會如何將結果導出到.csv文件? – MARWEBIST

+0

你應該創建一個[結果]數據框,它在開始時是空的,然後在循環中追加[attribut1,attribut2,intersection](關於append,請參閱:http://pandas.pydata.org/ pandas-docs/stable/generated/pandas.DataFrame.append.html)。熊貓數據框提供了一個[to_csv]方法,它可以讓你把它保存在一個文件中。 –

1

在您轉動表,就可以計算出自身的換位交叉積,然後將上三角結果轉換爲長格式:

import pandas as pd 
import numpy as np 
mat = df.pivot_table(index='User ID', columns='Attribute', aggfunc=len, fill_value=0) 

tprod = mat.T.dot(mat)   # calculate the tcrossprod here 
result = tprod.where((np.triu(np.ones(tprod.shape, bool), 1)), np.nan).stack().rename('value') 
           # extract the upper triangular part 
result.index.names = ['Attr1', 'Attr2'] 
result.reset_index().sort_values('value', ascending = False) 

enter image description here

相關問題