Python：MemoryError當計算tf-idf熊貓中兩列之間的餘弦相似度

我試圖計算熊貓數據框中兩列之間的tf-idf向量餘弦相似度。一列包含一個搜索查詢，另一列包含一個產品標題。餘弦相似度值旨在成爲搜索引擎/排序機器學習算法的「特徵」。Python：MemoryError當計算tf-idf熊貓中兩列之間的餘弦相似度

我在iPython筆記本上做了這個，不幸的是運行到MemoryErrors中，我不知道爲什麼經過幾個小時的挖掘。

我的設置：

聯想E560筆記本
睿i7-6500U @ 2.50 GHz的
16 GB的RAM
的Windows 10
使用蟒蛇3.5內核的一個新的更新所有庫

我測試了我的代碼/目標上的一個小玩具數據集作爲每一個類似計算器的問題正是如此：

import pandas as pd 
from sklearn.feature_extraction.text import TfidfVectorizer 
from scipy import spatial 

clf = TfidfVectorizer() 

a = ['hello world', 'my name is', 'what is your name?', 'max cosine sim'] 
b = ['my name is', 'hello world', 'my name is what?', 'max cosine sim'] 

df = pd.DataFrame(data={'a':a, 'b':b}) 

clf.fit(df['a'] + " " + df['b']) 

tfidf_a = clf.transform(df['a']).todense() 
tfidf_b = clf.transform(df['b']).todense() 

row_similarities = [1 - spatial.distance.cosine(tfidf_a[x],tfidf_b[x]) for x in range(len(tfidf_a)) ] 

df['tfidf_cosine_similarity'] = row_similarities 

print(df)

這給出了以下（好！）輸出：

    a     b tfidf_cosine_similarity 
0   hello world  my name is     0.000000 
1   my name is  hello world     0.000000 
2 what is your name? my name is what?     0.725628 
3  max cosine sim max cosine sim     1.000000

然而，當我嘗試同樣的方法適用於數據框（df_all_export）與尺寸186154×5（其中5列的查詢（SEARCH_TERM）和文件（PRODUCT_TITLE）這樣的2：

clf.fit(df_all_export['search_term'] + " " + df_all_export['product_title']) 

tfidf_a = clf.transform(df_all_export['search_term']).todense() 
tfidf_b = clf.transform(df_all_export['product_title']).todense() 

row_similarities = [1 - spatial.distance.cosine(tfidf_a[x],tfidf_b[x]) for x in range(len(tfidf_a)) ] 
df_all_export['tfidf_cosine_similarity'] = row_similarities 

df_all_export.head()

我得到......（沒有給這裏的整體錯誤，但你的想法）：

MemoryError        Traceback (most recent call last) 
<ipython-input-27-8308fcfa8f9f> in <module>() 
    12 clf.fit(df_all_export['search_term'] + " " + df_all_export['product_title']) 
    13 
---> 14 tfidf_a = clf.transform(df_all_export['search_term']).todense() 
    15 tfidf_b = clf.transform(df_all_export['product_title']).todense() 
    16

絕對失去了這一個，但我擔心的解決方案將是很簡單和優雅:)

預先感謝您！

來源

2017-03-23 Bango

請務必發佈完整的堆棧跟蹤，以便我們知道錯誤來自何處。 –

您仍然可以使用sklearn.metrics.pairwise方法疏林矩陣/陣列工作：

# I've executed your example up to (including): 
# ... 
clf.fit(df['a'] + " " + df['b']) 

A = clf.transform(df['a']) 

B = clf.transform(df['b']) 

from sklearn.metrics.pairwise import *

paired_cosine_distances會告訴你多遠或多麼不同的字符串（比較兩列「行由行」的價值觀）

0 - 意味着全場比賽

In [136]: paired_cosine_distances(A, B) 
Out[136]: array([ 1.  , 1.  , 0.27437247, 0.  ])

cosine_similarity將比較的第一個字符串列a，列b（第1行）中的所有字符串; a列第二串列b（行2）等所有字符串...

In [137]: cosine_similarity(A, B) 
Out[137]: 
array([[ 0.  , 1.  , 0.  , 0.  ], 
     [ 1.  , 0.  , 0.74162106, 0.  ], 
     [ 0.43929881, 0.  , 0.72562753, 0.  ], 
     [ 0.  , 0.  , 0.  , 1.  ]]) 

In [141]: A 
Out[141]: 
<4x10 sparse matrix of type '<class 'numpy.float64'>' 
     with 12 stored elements in Compressed Sparse Row format> 

In [142]: B 
Out[142]: 
<4x10 sparse matrix of type '<class 'numpy.float64'>' 
     with 12 stored elements in Compressed Sparse Row format>

注：所有的計算已經donw使用疏林矩陣 - 在我們沒有他們解壓記憶！

來源

2017-03-23 10:19:28 MaxU

非常感謝！我實施了你的解決方案，它很有魅力。雖然我正在等待解決方案，但我嘗試了使用列表和其他方法的解決方法，但沒有成功。你的解決方案運行的很好，很快:) – Bango

@Bango，很高興我可以幫助:) – MaxU

在上述MaxU公佈的友情幫助和解決方案中，我在此展示完成我嘗試實現的任務的完整代碼。除了MemoryError，當我嘗試一些「hacky」變通辦法時，tt也會在cosine-similarity計算中出現奇怪的nans。

注意到下面的代碼是一個部分片段，在這個意義上說，尺寸爲186,134 x 5的大數據框df_all_export已經在完整的代碼中構建。

我希望這可以幫助那些試圖在搜索查詢和匹配文檔之間使用tf-idf向量計算餘弦相似度的人。對於這樣一個常見的「問題」，我努力尋找SKLearn和Pandas實施的清晰解決方案。

import pandas as pd 
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.metrics.pairwise import paired_cosine_distances as pcd 

clf = TfidfVectorizer() 

clf.fit(df_all_export['search_term'] + " " + df_all_export['product_title']) 

A = clf.transform(df_all_export['search_term']) 
B = clf.transform(df_all_export['product_title']) 

cosine = 1 - pcd(A, B) 

df_all_export['tfidf_cosine'] = cosine

來源

2017-03-23 12:47:34 Bango

Python：MemoryError當計算tf-idf熊貓中兩列之間的餘弦相似度

回答

相關問題