2013-12-13 98 views
26

我知道如何在R中做到這一點。但是,熊貓中是否有任何功能將數據幀轉換爲包含兩個共存點的計數的nxn共生矩陣。在python熊貓中構建一個共現矩陣

例如一個矩陣DF:

import pandas as pd 

df = pd.DataFrame({'TFD' : ['AA', 'SL', 'BB', 'D0', 'Dk', 'FF'], 
        'Snack' : ['1', '0', '1', '1', '0', '0'], 
        'Trans' : ['1', '1', '1', '0', '0', '1'], 
        'Dop' : ['1', '0', '1', '0', '1', '1']}).set_index('TFD') 

print df 

>>> 
    Dop Snack Trans 
TFD     
AA 1  1  1 
SL 0  0  1 
BB 1  1  1 
D0 0  1  0 
Dk 1  0  0 
FF 1  0  1 

[6 rows x 3 columns] 

將產生:

Dop Snack Trans 

Dop 0  2  3 
Snack 2  0  2 
Trans 3  2  0 

由於矩陣對角線上的我想會有優化代碼的方式進行鏡像。

回答

29

這是一個簡單的線性代數,你乘矩陣的轉置(你的例子包含字符串,不要忘了將它們轉換爲整數):

>>> df_asint = df.astype(int) 
>>> coocc = df_asint.T.dot(df_asint) 
>>> coocc 
     Dop Snack Trans 
Dop  4  2  3 
Snack 2  3  2 
Trans 3  2  4 

如果用作R的答案,你想復位對角線,你可以使用numpy的的fill_diagonal

>>> import numpy as np 
>>> np.fill_diagonal(coocc.values, 0) 
>>> coocc 
     Dop Snack Trans 
Dop  0  2  3 
Snack 2  0  2 
Trans 3  2  0 
+1

我也許應該看看numpy的更多共同occurence矩陣。你只是把它的矩陣的點積和它的轉置。我想我可以在熊貓一步做到這一點 'df.T.dot(df)' 但我得到一個數據類型錯誤 – user3084006

+1

你有你的框架中的字符串,需要轉換像@alko建議或df.convert_objects (convert_numeric = True) – Jeff

+0

@Jeff yep我得到了代碼並在同一時間響應 – user3084006

5

演示在與NumPy:

import numpy as np 
np.random.seed(3) # for reproducibility 

# Generate data: 5 labels, 10 examples, binary. 
label_headers = 'Alice Bob Carol Dave Eve'.split(' ') 
label_data = np.random.randint(0,2,(10,5)) # binary here but could be any integer. 
print('labels:\n{0}'.format(label_data)) 

# Compute cooccurrence matrix 
cooccurrence_matrix = np.dot(label_data.transpose(),label_data) 
print('\ncooccurrence_matrix:\n{0}'.format(cooccurrence_matrix)) 

# Compute cooccurrence matrix in percentage 
# FYI: http://stackoverflow.com/questions/19602187/numpy-divide-each-row-by-a-vector-element 
#  http://stackoverflow.com/questions/26248654/numpy-return-0-with-divide-by-zero/32106804#32106804 
cooccurrence_matrix_diagonal = np.diagonal(cooccurrence_matrix) 
with np.errstate(divide='ignore', invalid='ignore'): 
    cooccurrence_matrix_percentage = np.nan_to_num(np.true_divide(cooccurrence_matrix, cooccurrence_matrix_diagonal[:, None])) 
print('\ncooccurrence_matrix_percentage:\n{0}'.format(cooccurrence_matrix_percentage)) 

輸出:

labels: 
[[0 0 1 1 0] 
[0 0 1 1 1] 
[0 1 1 1 0] 
[1 1 0 0 0] 
[0 1 1 0 0] 
[0 1 0 0 0] 
[0 1 0 1 1] 
[0 1 0 0 1] 
[1 0 0 1 0] 
[1 0 1 1 1]] 

cooccurrence_matrix: 
[[3 1 1 2 1] 
[1 6 2 2 2] 
[1 2 5 4 2] 
[2 2 4 6 3] 
[1 2 2 3 4]] 

cooccurrence_matrix_percentage: 
[[ 1.   0.33333333 0.33333333 0.66666667 0.33333333] 
[ 0.16666667 1.   0.33333333 0.33333333 0.33333333] 
[ 0.2   0.4   1.   0.8   0.4  ] 
[ 0.33333333 0.33333333 0.66666667 1.   0.5  ] 
[ 0.25  0.5   0.5   0.75  1.  ]] 

隨着熱圖使用matplotlib:

import numpy as np 
np.random.seed(3) # for reproducibility 

import matplotlib.pyplot as plt 


def show_values(pc, fmt="%.2f", **kw): 
    ''' 
    Heatmap with text in each cell with matplotlib's pyplot 
    Source: http://stackoverflow.com/a/25074150/395857 
    By HYRY 
    ''' 
    from itertools import izip 
    pc.update_scalarmappable() 
    ax = pc.get_axes() 
    for p, color, value in izip(pc.get_paths(), pc.get_facecolors(), pc.get_array()): 
     x, y = p.vertices[:-2, :].mean(0) 
     if np.all(color[:3] > 0.5): 
      color = (0.0, 0.0, 0.0) 
     else: 
      color = (1.0, 1.0, 1.0) 
     ax.text(x, y, fmt % value, ha="center", va="center", color=color, **kw) 

def cm2inch(*tupl): 
    ''' 
    Specify figure size in centimeter in matplotlib 
    Source: http://stackoverflow.com/a/22787457/395857 
    By gns-ank 
    ''' 
    inch = 2.54 
    if type(tupl[0]) == tuple: 
     return tuple(i/inch for i in tupl[0]) 
    else: 
     return tuple(i/inch for i in tupl) 

def heatmap(AUC, title, xlabel, ylabel, xticklabels, yticklabels): 
    ''' 
    Inspired by: 
    - http://stackoverflow.com/a/16124677/395857 
    - http://stackoverflow.com/a/25074150/395857 
    ''' 

    # Plot it out 
    fig, ax = plt.subplots()  
    c = ax.pcolor(AUC, edgecolors='k', linestyle= 'dashed', linewidths=0.2, cmap='RdBu', vmin=0.0, vmax=1.0) 

    # put the major ticks at the middle of each cell 
    ax.set_yticks(np.arange(AUC.shape[0]) + 0.5, minor=False) 
    ax.set_xticks(np.arange(AUC.shape[1]) + 0.5, minor=False) 

    # set tick labels 
    #ax.set_xticklabels(np.arange(1,AUC.shape[1]+1), minor=False) 
    ax.set_xticklabels(xticklabels, minor=False) 
    ax.set_yticklabels(yticklabels, minor=False) 

    # set title and x/y labels 
    plt.title(title) 
    plt.xlabel(xlabel) 
    plt.ylabel(ylabel)  

    # Remove last blank column 
    plt.xlim((0, AUC.shape[1])) 

    # Turn off all the ticks 
    ax = plt.gca()  
    for t in ax.xaxis.get_major_ticks(): 
     t.tick1On = False 
     t.tick2On = False 
    for t in ax.yaxis.get_major_ticks(): 
     t.tick1On = False 
     t.tick2On = False 

    # Add color bar 
    plt.colorbar(c) 

    # Add text in each cell 
    show_values(c) 

    # Proper orientation (origin at the top left instead of bottom left) 
    ax.invert_yaxis() 
    ax.xaxis.tick_top() 

    # resize 
    fig = plt.gcf() 
    fig.set_size_inches(cm2inch(40, 20)) 



def main(): 

    # Generate data: 5 labels, 10 examples, binary. 
    label_headers = 'Alice Bob Carol Dave Eve'.split(' ') 
    label_data = np.random.randint(0,2,(10,5)) # binary here but could be any integer. 
    print('labels:\n{0}'.format(label_data)) 

    # Compute cooccurrence matrix 
    cooccurrence_matrix = np.dot(label_data.transpose(),label_data) 
    print('\ncooccurrence_matrix:\n{0}'.format(cooccurrence_matrix)) 

    # Compute cooccurrence matrix in percentage 
    # FYI: http://stackoverflow.com/questions/19602187/numpy-divide-each-row-by-a-vector-element 
    #  http://stackoverflow.com/questions/26248654/numpy-return-0-with-divide-by-zero/32106804#32106804 
    cooccurrence_matrix_diagonal = np.diagonal(cooccurrence_matrix) 
    with np.errstate(divide='ignore', invalid='ignore'): 
     cooccurrence_matrix_percentage = np.nan_to_num(np.true_divide(cooccurrence_matrix, cooccurrence_matrix_diagonal[:, None])) 
    print('\ncooccurrence_matrix_percentage:\n{0}'.format(cooccurrence_matrix_percentage)) 

    # Add count in labels 
    label_header_with_count = [ '{0} ({1})'.format(label_header, cooccurrence_matrix_diagonal[label_number]) for label_number, label_header in enumerate(label_headers)] 
    print('\nlabel_header_with_count: {0}'.format(label_header_with_count)) 

    # Plotting 
    x_axis_size = cooccurrence_matrix_percentage.shape[0] 
    y_axis_size = cooccurrence_matrix_percentage.shape[1] 
    title = "Co-occurrence matrix\n" 
    xlabel= ''#"Labels" 
    ylabel= ''#"Labels" 
    xticklabels = label_header_with_count 
    yticklabels = label_header_with_count 
    heatmap(cooccurrence_matrix_percentage, title, xlabel, ylabel, xticklabels, yticklabels) 
    plt.savefig('image_output.png', dpi=300, format='png', bbox_inches='tight') # use format='svg' or 'pdf' for vectorial pictures 
    #plt.show() 


if __name__ == "__main__": 
    main() 
    #cProfile.run('main()') # if you want to do some profiling 

enter image description here

(PS:一個neat visualization of a co-occurrence matrix in D3.js。)

1

如果你有較大的語料庫和術語頻矩陣,使用稀疏矩陣乘法可能更有效。我使用矩陣乘法的相同技巧參考此頁上的algo答案。

import scipy.sparse as sp 
X = sp.csr_matrix(df.astype(int).values) # convert dataframe to sparse matrix 
Xc = X.T * X # multiply sparse matrix # 
Xc.setdiag(0) # reset diagonal 
print(Xc.todense()) # to print co-occurence matrix in dense format 

Xc這裏將是稀疏的CSR格式