如何從熊貓數據框創建一個scipy稀疏矩陣？

我在尋找一個更好的方式來創建一個從pandas dataframe一個scipy sparse matrix。如何從熊貓數據框創建一個scipy稀疏矩陣？

這裏是我目前有

row = []; column = []; values = [] 
for each row of the dataframe 
    for each column of the row 
     add the row_id to row 
     add the column_id to column 
     add the value to values 
sparse_matrix = sparse.coo_matrix((values, (row, column), shape=(max(row)+1,max(column)+1))

但是我個人認爲會有一個更好的方式來做事的僞代碼。幾乎什麼工作是以下

dataframe.unstack().to_sparse().to_coo()

不過，這回我三（稀疏矩陣，列ID和行ID）的。問題是我需要行ID實際上是稀疏矩陣的一部分。

下面是一個完整的例子。我有一個數據幀，看起來像如下

  instructor_id primary_department_id 
id 
4109   2093     129 
6633   2093     129 
6634   2094     129 
6635   2095     129

如果我做我上面提到的操作，我得到

ipdb> data = dataframe.unstack().to_sparse().to_coo()[0] 
ipdb> data 
<2x4 sparse matrix of type '<type 'numpy.int64'>' 
    with 8 stored elements in COOrdinate format> 
ipdb> print data 
    (0, 0) 2093 
    (0, 1) 2093 
    (0, 2) 2094 
    (0, 3) 2095 
    (1, 0) 129 
    (1, 1) 129 
    (1, 2) 129 
    (1, 3) 129

但我需要

ipdb> print data 
    (4109, 0) 2093 
    (6633, 0) 2093 
    (6634, 0) 2094 
    etc.

我願意使用任何額外庫或依賴項。

似乎有一個question that asks for the reverse operation，但我還沒有找到此操作的解決方案。

來源

2016-05-07 kshikama

「問題是我需要行ID實際上是稀疏矩陣的一部分」 - 你能否明白你的意思是什麼？ –

完整的工作示例程序可以幫助用硬編碼的輸入數據。我不確定爲什麼你想把一個完整的，密集的DataFrame變成一個稀疏矩陣 - 你確定要這麼做嗎？爲什麼？ –

你看過稀疏的熊貓版嗎？最近有幾個關於在scipy和sparse之間來回切換的問題。 http://pandas.pydata.org/pandas-docs/stable/sparse.html – hpaulj

我沒有安裝pandas，因此不能以數據幀開始。但是讓我們假設我已經解壓縮dataframe一個numpy的陣列（沒有方法或屬性類似values做到這一點？）：

In [40]: D 
Out[40]: 
array([[4109, 2093], # could be other columns 
     [6633, 2093], 
     [6634, 2094], 
     [6635, 2095]])

使從一個稀疏矩陣是直截了當 - 我只需要提取或構造3個數組：

In [41]: M=sparse.coo_matrix((D[:,1], (D[:,0], np.zeros(D.shape[0]))), 
    shape=(7000,1)) 

In [42]: M 
Out[42]: 
<7000x1 sparse matrix of type '<class 'numpy.int32'>' 
    with 4 stored elements in COOrdinate format> 

In [43]: print(M) 
    (4109, 0) 2093 
    (6633, 0) 2093 
    (6634, 0) 2094 
    (6635, 0) 2095

=======================

廣義兩個 '數據' 列

In [70]: D 
Out[70]: 
array([[4109, 2093, 128], 
     [6633, 2093, 129], 
     [6634, 2094, 127], 
     [6635, 2095, 126]]) 

In [76]: i,j,data=[],[],[] 

In [77]: for col in range(1,D.shape[1]): 
    i.extend(D[:,0]) 
    j.extend(np.zeros(D.shape[0],int)+(col-1)) 
    data.extend(D[:,col]) 
    ....:  

In [78]: i 
Out[78]: [4109, 6633, 6634, 6635, 4109, 6633, 6634, 6635] 

In [79]: j 
Out[79]: [0, 0, 0, 0, 1, 1, 1, 1] 

In [80]: data 
Out[80]: [2093, 2093, 2094, 2095, 128, 129, 127, 126] 

In [83]: M=sparse.coo_matrix((data,(i,j)),shape=(7000,D.shape[1]-1)) 

In [84]: M 
Out[84]: 
<7000x2 sparse matrix of type '<class 'numpy.int32'>' 
    with 8 stored elements in COOrdinate format> 

In [85]: print(M) 
    (4109, 0) 2093 
    (6633, 0) 2093 
    (6634, 0) 2094 
    (6635, 0) 2095 
    (4109, 1) 128 
    (6633, 1) 129 
    (6634, 1) 127 
    (6635, 1) 126

我懷疑你也可以做獨立的矩陣的每個列，並與sparse.bmat（塊）機制將它們組合起來，但我最熟悉的coo格式。

用於構建從子矩陣大型稀疏矩陣（這裏它們重疊）的另一個例子參見 Compiling n submatrices into an NxN matrix in numpy

。在那裏，我發現了一種通過更快的數組操作來加入塊的方法。這裏可能會這樣做。但我懷疑在幾列（和許多行上的extend）迭代是好的速度方面。

隨着bmat我可以構建爲同一件事：

In [98]: I, J = D[:,0], np.zeros(D.shape[0],int) 

In [99]: M1=sparse.coo_matrix((D[:,1],(I, J)), shape=(7000,1)) 
In [100]: M2=sparse.coo_matrix((D[:,2],(I, J)), shape=(7000,1)) 

In [101]: print(sparse.bmat([[M1,M2]])) 
    (4109, 0) 2093 
    (6633, 0) 2093 
    (6634, 0) 2094 
    (6635, 0) 2095 
    (4109, 1) 128 
    (6633, 1) 129 
    (6634, 1) 127 
    (6635, 1) 126

來源

2016-05-07 04:21:07 hpaulj

因此，對於更多的列，你會建議組合稀疏矩陣嗎？或者我應該嘗試像'M = sparse.coo_matrix（（append（D [：，1]，D [：2]）（append（D [：，0]，D [：，0]），append（np .zeros（D.shape [0]）），np.ones（D.shape [0]）））， shape =（7000,2））'？ – kshikama

我的意思是我需要它的任意數量的列，所以不會最終需要一個for循環，就像我在我原來的帖子中提出的那樣？ – kshikama

我對2個數據列的概括是否有幫助？ – hpaulj

一個簡單的解決辦法是：

import numpy as np 
import pandas as pd 
df = pd.DataFrame(data = [[2093, 129], [2093, 129], [2094, 129], [2095, 129]], index = [4109, 6633, 6634, 6635], columns = ['instructor_id', 'primary_department_id']) 

from scipy.sparse import lil_matrix 
sparse_matrix = lil_matrix((df.index.max()+1, len(df.columns))) 
for k, column_name in enumerate(df.columns): 
    sparse_matrix[df.index.values, np.full(len(df), k)] = df[column_name].values

如果您想使用壓縮格式，你可以把它轉換：

sparse_matrix = sparse_matrix.tocsc()

來源

2017-08-25 01:37:57 user8514366

如何從熊貓數據框創建一個scipy稀疏矩陣？

回答

相關問題