2017-09-27 21 views
1

我正在解決Instagram標籤問題。用戶通常會在發佈圖片時複製和粘貼標籤的「標籤」。針對不同主題的不同捆綁包。什麼是按大小和頻率對多個陣列的公共共享子陣列進行排序的有效方法?

因此,我可能會有我的「花園裏的東西」包,這將是[「花園」,「美麗的女人」,「treesoutside」,「greenlondon」]等等。它們通常長達二十至三十件。

有時他們可能會有幾個這樣的東西來保持變化。

我想要做的是通過查看他們發佈的過去的圖片,推薦使用一捆標籤。

這樣做,我將有標籤多個陣列,它們以前使用:

x = ["a", "b", "c", "d", "e"] 
y = ["a", "b", "d", "e", "f", "g"] 
z = ["a", "c", "d", "e", "f", "h"] 
... 

我想找到這些陣列條目的最大共同子集。

因此,在這種情況下,最大的子集在這三者中將是[「a」,「d」,「e」]。這很簡單,通過使用諸如x & y & z之類的東西來實現天真。

不過,我想創建一個基於內的所有審議中的數組的大小和頻率,這些子集的排名,這樣我可以顯示標籤的最常用的包:

[ 
    {bundle: ["a","d","e"], frequency: 3, size: 3}, 
    {bundle: ["e","f"], frequency: 2, size: 2}, 
    {bundle: ["a","b"], frequency: 2, size: 2}, 
    {bundle: ["b","d"], frequency: 2, size: 2}, 
    ... 
] 

假設這些捆綁包的最小尺寸有限制,請說兩個項目。

我使用Elasticsearch進行索引,但是我發現嘗試使用聚合來做到這一點很具挑戰性,所以我將圖像提取到Ruby中,然後在那裏創建列表。

作爲第一遍,我遍歷了所有這些數組,然後使用MD5散列鍵作爲唯一標識符查找其他數組的所有子集。但是這限制了結果。我懷疑,再添加一些通行證使得這種方法效率很低。

require 'digest' 

x = ["a", "b", "c", "d", "e"] 
y = ["a", "b", "d", "e", "f", "g"] 
z = ["a", "c", "d", "e", "f", "h"] 


def bundle_report arrays 
    arrays = arrays.collect(&:sort) 
    working = {} 
    arrays.each do |array| 
    arrays.each do |comparison| 
     next if array == comparison 
     subset = array & comparison 
     key = Digest::MD5.hexdigest(subset.join("")) 
     working[key] ||= {subset: subset, frequency: 0} 
     working[key][:frequency] += 1 
     working[key][:size] = subset.length 
    end 
    end 
    working 
end 

puts bundle_report([x, y, z]) 
=> {"bb4a3fb7097e63a27a649769248433f1"=>{:subset=>["a", "b", "d", "e"], :frequency=>2, :size=>4}, "b6fdd30ed956762a88ef4f7e8dcc1cae"=>{:subset=>["a", "c", "d", "e"], :frequency=>2, :size=>4}, "ddf4a04e121344a6e7ee2acf71145a99"=>{:subset=>["a", "d", "e", "f"], :frequency=>2, :size=>4}} 

添加第二遍這得到一個更好的結果:

def bundle_report arrays 
    arrays = arrays.collect(&:sort) 
    working = {} 
    arrays.each do |array| 
    arrays.each do |comparison| 
     next if array == comparison 
     subset = array & comparison 
     key = Digest::MD5.hexdigest(subset.join("")) 
     working[key] ||= {subset: subset, frequency: 0} 
     working[key][:frequency] += 1 
     working[key][:size] = subset.length 
    end 
    end 

    original_working = working.dup 

    original_working.each do |key, item| 
    original_working.each do |comparison_key, comparison| 
     next if item == comparison 
     subset = item[:subset] & comparison[:subset] 
     key = Digest::MD5.hexdigest(subset.join("")) 
     working[key] ||= {subset: subset, frequency: 0} 
     working[key][:frequency] += 1 
     working[key][:size] = subset.length 
    end 
    end 
    working 
end 

puts bundle_report([x, y, z]) 
=> {"bb4a3fb7097e63a27a649769248433f1"=>{:subset=>["a", "b", "d", "e"], :frequency=>2, :size=>4}, "b6fdd30ed956762a88ef4f7e8dcc1cae"=>{:subset=>["a", "c", "d", "e"], :frequency=>2, :size=>4}, "ddf4a04e121344a6e7ee2acf71145a99"=>{:subset=>["a", "d", "e", "f"], :frequency=>2, :size=>4}, "a562cfa07c2b1213b3a5c99b756fc206"=>{:subset=>["a", "d", "e"], :frequency=>6, :size=>3}} 

您能否提供一個有效的方式來建立大亞的這個排名?

回答

1

而不是做每個數組與每個其他數組的交集,這可能會很快失去控制,我會試圖保持迄今爲止所見的所有可能組合的持久索引(在Elasticsearch中)並計算它們的頻率。然後,對於每一組新的標籤,對於該標籤的所有子組合,將頻率計數增加1。

這裏有一個快速素描:

require 'digest' 

def bundle_report(arrays, min_size = 2, max_size = 10) 

    combination_index = {} 

    arrays.each do |array| 

    (min_size..[max_size,array.length].min).each do |length| 

     array.combination(length).each do |combination| 

     key = Digest::MD5.hexdigest(combination.join('')) 

     combination_index[key] ||= {bundle: combination, frequency: 0, size: length} 
     combination_index[key][:frequency] += 1 

     end 

    end 

    end 

    combination_index.to_a.sort_by {|x| [x[1][:frequency], x[1][:size]] }.reverse 

end 

input_arrays = [ 
    ["a", "b", "c", "d", "e"], 
    ["a", "b", "d", "e", "f", "g"], 
    ["a", "c", "d", "e", "f", "h"] 
] 

bundle_report(input_arrays)[0..5].each do |x| 
    puts x[1] 
end 

導致:

{:bundle=>["a", "d", "e"], :frequency=>3, :size=>3} 
{:bundle=>["d", "e"], :frequency=>3, :size=>2} 
{:bundle=>["a", "d"], :frequency=>3, :size=>2} 
{:bundle=>["a", "e"], :frequency=>3, :size=>2} 
{:bundle=>["a", "d", "e", "f"], :frequency=>2, :size=>4} 
{:bundle=>["a", "b", "d", "e"], :frequency=>2, :size=>4} 

這可能不夠靈活,要麼雖然。

+0

謝謝弗蘭基!我正在思考,我會回覆。真的很感謝你看看。 – stef