計數兩個字符串之間唯一的ID重疊

我有一個兩列的數據集。第一列包含唯一的用戶ID，第二列包含連接到這些ID的屬性。計數兩個字符串之間唯一的ID重疊

例如：

------------------------ 
User ID  Attribute 
------------------------ 
1234  blond 
1235  brunette 
1236  blond 
1234  tall 
1235  tall 
1236  short 
------------------------

我想知道的是屬性之間的相關性。在上面的例子中，我想知道一個金髮碧眼的人多麼高。我期望的輸出是：

------------------------------ 
Attr 1  Attr 2  Overlap 
------------------------------ 
blond  tall   1 
blond  short  1 
brunette tall   1 
brunette short  0 
------------------------------

我試着用熊貓來透視數據，並獲得輸出，但由於我的數據集有數百個屬性的，我現在的嘗試是不可行的。

df = pandas.read_csv('myfile.csv')  

df.pivot_table(index='User ID', columns'Attribute', aggfunc=len, fill_value=0)

我的電流輸出：

-------------------------------- 
Blond Brunette Short Tall 
-------------------------------- 
    0  1   0  1 
    1  0   0  1 
    1  0   1  0 
--------------------------------

是否有一種方式來獲得我想要的輸出？提前致謝。

來源

2016-11-02 MARWEBIST

我認爲你的第一步應該是把它變成更好的關係秩序。這些屬性沒有邏輯分成頭髮顏色/高度屬性 – brianpck

確實！我試了一個答案，但不能做出這些區別 –

您不需經過使用itertools product尋找每一個可能的屬性對夫婦，然後在此匹配行：

import pandas as pd 
from itertools import product 

# 1) creating pandas dataframe 
df = [ ["1234" , "blond"], 
     ["1235" , "brunette"], 
     ["1236" , "blond" ], 
     ["1234" , "tall"], 
     ["1235" , "tall"], 
     ["1236" , "short"]] 

df = pd.DataFrame(df) 
df.columns = ["id", "attribute"] 

#2) creating all the possible attributes binomes 
attributs = set(df.attribute) 
for attribut1, attribut2 in product(attributs, attributs): 
    if attribut1!=attribut2: 
     #3) selecting the rows for each attribut 
     df1 = df[df.attribute == attribut1]["id"] 
     df2 = df[df.attribute == attribut2]["id"] 
     #4) finding the ids that are matching both attributs 
     intersection= len(set(df1).intersection(set(df2))) 
     if intersection: 
      #5) displaying the number of matches 
      print attribut1, attribut2, intersection

捐贈：

tall brunette 1 
tall blond 1 
brunette tall 1 
blond tall 1 
blond short 1 
short blond 1

編輯

它是那麼容易細化到得到你想要的輸出：

import pandas as pd 
from itertools import product 

# 1) creating pandas dataframe 
df = [ ["1234" , "blond"], 
     ["1235" , "brunette"], 
     ["1236" , "blond" ], 
     ["1234" , "tall"], 
     ["1235" , "tall"], 
     ["1236" , "short"]] 

df = pd.DataFrame(df) 
df.columns = ["id", "attribute"] 

wanted_attribute_1 = ["blond", "brunette"] 

#2) creating all the possible attributes binomes 
attributs = set(df.attribute) 
for attribut1, attribut2 in product(attributs, attributs): 
    if attribut1 in wanted_attribute_1 and attribut2 not in wanted_attribute_1: 
     if attribut1!=attribut2: 
      #3) selecting the rows for each attribut 
      df1 = df[df.attribute == attribut1]["id"] 
      df2 = df[df.attribute == attribut2]["id"] 
      #4) finding the ids that are matching both attributs 
      intersection= len(set(df1).intersection(set(df2))) 
      #5) displaying the number of matches 
      print attribut1, attribut2, intersection

捐贈：

brunette tall 1 
brunette short 0 
blond tall 1 
blond short 1

來源

2016-11-02 14:41:23

謝謝。這給了我正在尋找的輸出。我會如何將結果導出到.csv文件？ – MARWEBIST

你應該創建一個[結果]數據框，它在開始時是空的，然後在循環中追加[attribut1，attribut2，intersection]（關於append，請參閱：http://pandas.pydata.org/ pandas-docs/stable/generated/pandas.DataFrame.append.html）。熊貓數據框提供了一個[to_csv]方法，它可以讓你把它保存在一個文件中。 –

在您轉動表，就可以計算出自身的換位交叉積，然後將上三角結果轉換爲長格式：

import pandas as pd 
import numpy as np 
mat = df.pivot_table(index='User ID', columns='Attribute', aggfunc=len, fill_value=0) 

tprod = mat.T.dot(mat)   # calculate the tcrossprod here 
result = tprod.where((np.triu(np.ones(tprod.shape, bool), 1)), np.nan).stack().rename('value') 
           # extract the upper triangular part 
result.index.names = ['Attr1', 'Attr2'] 
result.reset_index().sort_values('value', ascending = False)

來源

2016-11-02 14:42:44 Psidom

計數兩個字符串之間唯一的ID重疊

回答

相關問題