2017-02-07 59 views
0

我嘗試以編程方式從數據集中刪除幾乎重複的數據之一。我的數據集在邏輯上類似於下表。如您所見,數據集中有兩行,人類可以很容易地理解這兩個數據是相關的,並且可能由同一個人添加。
如何找到兩行數據之間的相似性

enter image description here

我解決這個問題是使用萊文斯坦比較單獨字段(姓名,地址,電話號碼),並發現他們的相似率。然後我計算平均比率爲0.77873。這種相似性結果似乎很低。我的python代碼就像

from Levenshtein import ratio 
name  = ratio("Game of ThOnes Books for selling","Selling Game of Thrones books") 
address = ratio("George Washington street","George Washington st.") 
phone = ratio("555-55-55","0(555)-55-55") 

total_ratio = name+address+phone 
print total_ratio/3 #Average ratio 

我的問題是兩種比較行數據的最佳方式是什麼?這樣做需要哪些算法或方法?

+0

你想在R或Python的解決方案嗎? –

+0

@RonakShah其實沒關係。我只想要合理的解決方案。 –

回答

1

我們可以計算行之間的距離矩陣,形成簇並選擇簇成員 作爲相似行的候選。

使用Rstringdistmatrix函數從stringdist包允許 字符串輸入之間的距離計算。

stringdist支持的距離方法如下。見package manual 更多細節

#Method name; Description 
#osa ; Optimal string aligment, (restricted Damerau-Levenshtein distance). 
#lv ; Levenshtein distance (as in R's native adist). 
#dl ; Full Damerau-Levenshtein distance. 
#hamming ; Hamming distance (a and b must have same nr of characters). 
#lcs ; Longest common substring distance. 
#qgram ;q-gram distance. 
#cosine ; cosine distance between q-gram profiles 
#jaccard ; Jaccard distance between q-gram profiles 
#jw ; Jaro, or Jaro-Winker distance. 
#soundex ; Distance based on soundex encoding (see below) 

數據:

library("stringdist") 

#have modified the data slightly to include dissimilar datapoints 
Date = c("07-Jan-17","06-Feb-17","03-Mar-17") 
name  = c("Game of ThOnes Books for selling","Selling Game of Thrones books","Harry Potter BlueRay") 
address = c("George Washington street","George Washington st.","Central Avenue") 
phone = c("555-55-55","0(555)-55-55","111-222-333") 
DF = data.frame(Date,name,address,phone,stringsAsFactors=FALSE) 

DF 
#  Date        name     address  phone 
#1 07-Jan-17 Game of ThOnes Books for selling George Washington street 555-55-55 
#2 06-Feb-17 Selling Game of Thrones books George Washington st. 0(555)-55-55 
#3 03-Mar-17    Harry Potter BlueRay   Central Avenue 111-222-333 

層次聚類:

rowLabels = sapply(DF[,"name"],function(x) paste0(head(unlist(strsplit(x," ")),2),collapse="_")) 

#create string distance matrix, hierarchical cluter object and corresponding plot 
nameDist = stringdistmatrix(DF[,"name"]) 
nameHC = hclust(nameDist) 

plot(nameHC,labels = rowLabels ,main="HC plot : name") 

enter image description here

addressDist = stringdistmatrix(DF[,"address"]) 
addressDistHC = hclust(addressDist) 

plot(addressDistHC ,labels = rowLabels, main="HC plot : address") 

enter image description here

phoneDist = stringdistmatrix(DF[,"phone"]) 
phoneHC = hclust(phoneDist) 

plot(phoneHC ,labels = rowLabels, main="HC plot : phone") 

enter image description here

類似的行:

該行始終形成該數據集兩個集羣,以識別集羣的成員,我們可以做

clusterDF = data.frame(sapply(DF[,-1],function(x) cutree(hclust(stringdistmatrix(x)),2))) 
clusterDF$rowSummary = rowSums(clusterDF) 

clusterDF 
# name address phone rowSummary 
#1 1  1  1   3 
#2 1  1  1   3 
#3 2  2  2   6 


#row frequency 

rowFreq = table(clusterDF$rowSummary) 
#3 6 
#2 1 

#we filter rows with frequency > 1 
similarRowValues = as.numeric(names(which(rowFreq>1))) 


DF[clusterDF$rowSummary == similarRowValues,] 
#  Date        name     address  phone 
#1 07-Jan-17 Game of ThOnes Books for selling George Washington street 555-55-55 
#2 06-Feb-17 Selling Game of Thrones books George Washington st. 0(555)-55-55 

這個演示對簡單/玩具數據集運行良好,但是對於真正的數據集,你不得不用字符串距離計算方法,簇數等來修飾,但是我希望這可以讓你指向正確的方向。

相關問題