2012-07-16 71 views
12

我知道這不是一個特定的編碼問題,但這是最適合提問的地方。所以請耐心等待。基於用戶的過濾:推薦系統

假設我有一個像下面給出一個字典,列出10喜歡每個人

likes={ 
    "rajat":{"music","x-men","programming","hindi","english","himesh","lil wayne","rap","travelling","coding"}, 
    "steve":{"travelling","pop","hanging out","friends","facebook","tv","skating","religion","english","chocolate"}, 
    "toby":{"programming","pop","rap","gardens","flowers","birthday","tv","summer","youtube","eminem"}, 
    "ravi":{"skating","opera","sony","apple","iphone","music","winter","mango shake","heart","microsoft"}, 
    "katy":{"music","pics","guitar","glamour","paris","fun","lip sticks","cute guys","rap","winter"}, 
    "paul":{"office","women","dress","casuals","action movies","fun","public speaking","microsoft","developer"}, 
    "sheila":{"heart","beach","summer","laptops","youtube","movies","hindi","english","cute guys","love"}, 
    "saif":{"women","beach","laptops","movies","himesh","world","earth","rap","fun","eminem"} 
    "mark":{"pilgrimage","programming","house","world","books","country music","bob","tom hanks","beauty","tigers"}, 
    "stuart":{"rap","smart girls","music","wrestling","brock lesnar","country music","public speaking","women","coding","iphone"}, 
    "grover":{"skating","mountaineering","racing","athletics","sports","adidas","nike","women","apple","pop"}, 
    "anita":{"heart","sunidhi","hindi","love","love songs","cooking","adidas","beach","travelling","flowers"}, 
    "kelly":{"travelling","comedy","tv","facebook","youtube","cooking","horror","movies","dublin","animals"}, 
    "dino":{"women","games","xbox","x-men","assassin's creed","pop","rap","opera","need for speed","jeans"}, 
    "priya":{"heart","mountaineering","sky diving","sony","apple","pop","perfumes","luxury","eminem","lil wayne"}, 
    "brenda":{"cute guys","xbox","shower","beach","summer","english","french","country music","office","birds"} 
} 

我怎麼能確定誰擁有類似likes.Or人的項目,也許誰二人類似於most.Also這將是如果您可以將我指向適用於基於用戶或基於項目的過濾的示例或教程,則會很有幫助。

+1

編程集體智慧的[第2章](http://books.google.co.uk/books?id=fEsZ3Ey-Hq4C&lpg=PP1&pg=PA7#v=onepage&q&f=false)對此進行了全面的介紹。示例代碼在Python中,這是另一個優點。 – 2012-07-16 10:52:02

+0

我知道這本書,但它是非常古老的(2007年出版),網絡已經發生了很大的變化。所以我不認爲這本書的大部分例子今天都會有效。 – 2012-07-16 10:55:37

+4

基本技術仍然適用於您提供的樣本數據。如果你正在尋找更復雜/可擴展的東西,那麼你可能想在你的問題中提到這一點。它可能也值得一提,你已經嘗試或考慮過。 – 2012-07-16 10:59:54

回答

10

(聲明,我不是這方面擅長,只有具備集體濾波的傳遞知識,下面是一個簡單的資源集合,我發現有用)

這個的基礎知識在Chapter 2 of the "Programming Collective Intelligence" book中有相當全面的介紹。示例代碼在Python中,這是另一個優點。

您也可能會發現這個網站很有用 - A Programmer's Guide to Data Mining,特別是Chapter 2Chapter 3其中討論了推薦系統和基於項目的篩選。

總之,可以使用諸如計算Pearson Correlation Coefficient,Cosine Similarity,k-nearest neighbours等的技術來基於他們已經喜歡/購買/投票的項目來確定用戶之間的相似性。

請注意,這裏有各種爲此目的而編寫的python庫, pysuggestCrabpython-recsysSciPy.stats.stats.pearsonr

對於用戶數量超過項目數量的大型數據集,您可以通過反演數據並計算項目之間的相關性(例如基於項目的過濾)來更好地擴展解決方案,並使用它來推斷相似的用戶。當然,您不會實時執行此操作,但會將定期重新計算安排爲後端任務。有些方法可以並行/分配,以大幅度縮短計算時間(假設您有資源投入)。

1

我能想到的最基本的方法是找到每個人的喜好列表之間的交集,其中最喜歡匹配的兩個人將具有最高的交集數量。

可以使用類似list(set(list1).intersection(list2))的東西。這將返回一個包含定義交叉點的項目的列表。

請記住,這種方法不能很好地擴展到大量條目,因爲它要求每個用戶喜歡相互比較,它的複雜度大約爲O(n^2),其中n是用戶的數量。

在你的一些評論你提到的協同過濾,但通常適用於具有相同項目由不同的用戶排名,然後隊伍之間找到相似之處,這樣你就可以推斷誰擁有在排一些項目的用戶以同樣的方式,但不是其他項目(在這裏您使用在其他項目上給予類似排名的用戶的排名)。我不認爲這是相同的問題。

3

SequenceMatcher in difflib對這種事情很有用。如果使用ratio()它返回對應於兩個序列之間的相似性0和1之間的值,從該文檔:

返回序列相似性的量度,在範圍內的浮子[0,1] 。 其中T是兩個序列中元素的總數,M是 匹配的數目,這是2.0 * M/T.注意,如果 序列是相同的,則這是1.0;如果它們沒有共同之處,則爲0.0 。

從你的榜樣,只有'rajat'針對其他人比較,(由[]開關內部{}修正到詞典):

import difflib 
for key in likes: 
    print 'rajat', key, difflib.SequenceMatcher(None,likes['rajat'],likes[key]).ratio() 
#Output: 
rajat sheila 0.2 
rajat katy 0.2 
rajat brenda 0.1 
rajat saif 0.2 
rajat dino 0.2 
rajat toby 0.2 
rajat mark 0.1 
rajat steve 0.1 
rajat priya 0.1 
rajat grover 0.0 
rajat ravi 0.1 
rajat rajat 1.0 
rajat stuart 0.2 
rajat kelly 0.1 
rajat paul 0.0 
rajat anita 0.2 
+0

謝謝,但我看起來像「協作過濾」。任何關於協作過濾的幫助將不勝感激。 – 2012-07-16 10:57:33

0
for k in likes: 
    if likes["rajat"] & likes[k]: 
     print k, likes["rajat"] & likes[k] 
    else: 
     print k, " No Like with rajat" 

Output 

sheila set(['hindi', 'english']) 
katy set(['music', 'rap']) 
brenda set(['english']) 
saif set(['himesh', 'rap']) 
dino set(['x-men', 'rap']) 
toby set(['programming', 'rap']) 
mark set(['programming']) 
steve set(['travelling', 'english']) 
priya set(['lil wayne']) 
grover No Likes with rajat 
ravi set(['music']) 
rajat set(['lil wayne', 'x-men', 'himesh', 'coding', 'programming', 'music', 'hindi', 'rap', 'english', 'travelling']) 
stuart set(['music', 'coding', 'rap']) 
kelly set(['travelling']) 
paul No Likes with rajat 
anita set(['travelling', 'hindi']) 

這會比較常見,如「拉雅」的與字典的其他成員。必須有一個更好的方法來做到這一點

7

使用python recsys庫[http://ocelma.net/software/python-recsys/build/html/quickstart.html]

from recsys.algorithm.factorize import SVD 
from recsys.datamodel.data import Data 

likes={ 
    "rajat":{"music","x-men","programming","hindi","english","himesh","lil wayne","rap","travelling","coding"}, 
    "steve":{"travelling","pop","hanging out","friends","facebook","tv","skating","religion","english","chocolate"}, 
    "toby":{"programming","pop","rap","gardens","flowers","birthday","tv","summer","youtube","eminem"}, 
    "ravi":{"skating","opera","sony","apple","iphone","music","winter","mango shake","heart","microsoft"}, 
    "katy":{"music","pics","guitar","glamour","paris","fun","lip sticks","cute guys","rap","winter"}, 
    "paul":{"office","women","dress","casuals","action movies","fun","public speaking","microsoft","developer"}, 
    "sheila":{"heart","beach","summer","laptops","youtube","movies","hindi","english","cute guys","love"}, 
    "saif":{"women","beach","laptops","movies","himesh","world","earth","rap","fun","eminem"}, 
    "mark":{"pilgrimage","programming","house","world","books","country music","bob","tom hanks","beauty","tigers"}, 
    "stuart":{"rap","smart girls","music","wrestling","brock lesnar","country music","public speaking","women","coding","iphone"}, 
    "grover":{"skating","mountaineering","racing","athletics","sports","adidas","nike","women","apple","pop"}, 
    "anita":{"heart","sunidhi","hindi","love","love songs","cooking","adidas","beach","travelling","flowers"}, 
    "kelly":{"travelling","comedy","tv","facebook","youtube","cooking","horror","movies","dublin","animals"}, 
    "dino":{"women","games","xbox","x-men","assassin's creed","pop","rap","opera","need for speed","jeans"}, 
    "priya":{"heart","mountaineering","sky diving","sony","apple","pop","perfumes","luxury","eminem","lil wayne"}, 
    "brenda":{"cute guys","xbox","shower","beach","summer","english","french","country music","office","birds"} 
} 

data = Data() 
VALUE = 1.0 
for username in likes: 
    for user_likes in likes[username]: 
     data.add_tuple((VALUE, username, user_likes)) # Tuple format is: <value, row, column> 

svd = SVD() 
svd.set_data(data) 
k = 5 # Usually, in a real dataset, you should set a higher number, e.g. 100 
svd.compute(k=k, min_values=3, pre_normalize=None, mean_center=False, post_normalize=True) 

svd.similar('sheila') 
svd.similar('rajat') 

結果A液:

In [11]: svd.similar('sheila') 
Out[11]: 
[('sheila', 0.99999999999999978), 
('brenda', 0.94929845546505753), 
('anita', 0.85943494201162518), 
('kelly', 0.53385495931440263), 
('saif', 0.39985366653259058), 
('rajat', 0.30757664244952165), 
('toby', 0.28541364367155014), 
('priya', 0.26184289111194581), 
('steve', 0.25043700194182622), 
('katy', 0.21812807229358305)] 

In [12]: svd.similar('rajat') 
Out[12]: 
[('rajat', 1.0000000000000004), 
('mark', 0.89164019482177692), 
('katy', 0.65207273451425907), 
('stuart', 0.61675507205285718), 
('steve', 0.55730648750670264), 
('anita', 0.49836982296014803), 
('brenda', 0.42759524471725929), 
('kelly', 0.40436047539358799), 
('toby', 0.35972227835054826), 
('ravi', 0.31113813325818901)] 
+0

謝謝!我一直在尋找這樣的一段時間 – nickromano 2013-09-18 00:49:24

+0

偉大的圖書館! (我注意到你是作者)。但是,與Python 3不兼容。 – Siddhartha 2017-12-10 18:42:48

0

人們也可以使用scikit-學做基於用戶的過濾:

以更簡單的例如,如果您有:

"stuart":{"rap","rock"}

,你想研究他的音樂品味相似性:

"toby:{"hip-hop","pop","rap"}

您可以使用sklearn的成對餘弦相似功能,

from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.metrics.pairwise import cosine_similarity 

vec = CountVectorizer(analyzer='char') 
vec.fit(stuart_list) 

x = cosine_similarity(vec.transform(toby_list), 
       vec.transform(stuart_list)) 

,這將給你一個餘弦矩陣,如:

[[ 0.166 0.327 1] 
[ 0.123 0.267 0.230]] 

其中第一行代表rap與託比所有3個選擇的餘弦相似度。請注意,1表示完全相似,用適當的三角函數表示2個選項的角度爲0°(即相同),因此餘弦爲1.

第二行類似代表rock的餘弦與託比的所有選擇相似。

我找不到找到sklearn中兩個列表之間的總體相似度的方法,但是,考慮到餘弦矩陣,您可以計算其中的1 s的數量,並將其作爲相似度數字。或者您可以統計0.9 s及以上的數字來解釋「hip-hop」和「hiphop」等幾乎相同的詞。

(Sklearn也有euclidean相似性,可用作餘弦相似性的替代品。)