2017-08-15 33 views
2

我正在做一個新聞推薦系統,我需要爲他們閱讀的用戶和新聞建立一個表格。我的原始數據,就像這樣:如何將我的索引向量更改爲可用於sklearn的稀疏特徵向量?

001436800277225 [12,456,157] 
009092130698762 [248] 
010003000431538 [361,521,83] 
010156461231357 [173,67,244] 
010216216021063 [203,97] 
010720006581483 [86] 
011199797794333 [142,12,86,411,201] 
011337201765123 [123,41] 
011414545455156 [62,45,621,435] 
011425002581540 [341,214,286] 

第一列是userID,第二列是newsIDnewsID是一個索引列,例如在轉換之後,[12,456,157]在第一行意味着該用戶已經讀過第12,456和157條消息(在稀疏向量中,第12列,第456列和第157列是1,而其他列具有值0)。我想將這些數據轉換爲稀疏矢量格式,可用作Kmeans中的輸入矢量或sklearn的DBscan算法。 我該怎麼做?

回答

1

一種選擇是明確構建稀疏矩陣。我經常發現在COO matrix format中構建矩陣然後投射到CSR format更容易。

from scipy.sparse import coo_matrix 

input_data = [ 
    ("001436800277225", [12,456,157]), 
    ("009092130698762", [248]), 
    ("010003000431538", [361,521,83]), 
    ("010156461231357", [173,67,244])  
] 

NUMBER_MOVIES = 1000 # maximum index of the movies in the data 
NUMBER_USERS = len(input_data) # number of users in the model 

# you'll probably want to have a way to lookup the index for a given user id. 
user_row_map = {} 
user_row_index = 0 

# structures for coo format 
I,J,data = [],[],[] 
for user, movies in input_data: 

    if user not in user_row_map: 
     user_row_map[user] = user_row_index 
     user_row_index+=1 

    for movie in movies: 
     I.append(user_row_map[user]) 
     J.append(movie) 
     data.append(1) # number of times users watched the movie 

# create the matrix in COO format; then cast it to CSR which is much easier to use 
feature_matrix = coo_matrix((data, (I,J)), shape=(NUMBER_USERS, NUMBER_MOVIES)).tocsr() 
+0

'csr_matrix'接受輸入的'coo'風格。在實踐中,儘管它做你所做的事情 - 做一個「咕咕」,然後轉換。 – hpaulj

1

使用MultiLabelBinarizersklearn.preprocessing

from sklearn.preprocessing import MultiLabelBinarizer 

mlb = MultiLabelBinarizer() 

pd.DataFrame(mlb.fit_transform(df.newsID), columns=mlb.classes_) 

    12 41 45 62 67 83 86 97 123 142 ... 244 248 286 341 361 411 435 456 521 621 
0 1 0 0 0 0 0 0 0 0 0 ...  0 0 0 0 0 0 0 1 0 0 
1 0 0 0 0 0 0 0 0 0 0 ...  0 1 0 0 0 0 0 0 0 0 
2 0 0 0 0 0 1 0 0 0 0 ...  0 0 0 0 1 0 0 0 1 0 
3 0 0 0 0 1 0 0 0 0 0 ...  1 0 0 0 0 0 0 0 0 0 
4 0 0 0 0 0 0 0 1 0 0 ...  0 0 0 0 0 0 0 0 0 0 
5 0 0 0 0 0 0 1 0 0 0 ...  0 0 0 0 0 0 0 0 0 0 
6 1 0 0 0 0 0 1 0 0 1 ...  0 0 0 0 0 1 0 0 0 0 
7 0 1 0 0 0 0 0 0 1 0 ...  0 0 0 0 0 0 0 0 0 0 
8 0 0 1 1 0 0 0 0 0 0 ...  0 0 0 0 0 0 1 0 0 1 
9 0 0 0 0 0 0 0 0 0 0 ...  0 0 1 1 0 0 0 0 0 0 
+0

非常感謝。這是做這件事的好方法。但是我的數據是高維的,就像大約800000 * 92000一樣,每行1個數量只有不到10列,而其他90000 +列是0.我覺得這個解決方案可能會浪費大量資源,是嗎? –

+0

'sklearn'可能有一個創建稀疏矩陣的處理器,如https://stackoverflow.com/questions/45678491/python-data-structure-of-csr-matrix中所述。熊貓稀疏格式與「scipy」不同。 – hpaulj