我正在解決Instagram標籤問題。用戶通常會在發佈圖片時複製和粘貼標籤的「標籤」。針對不同主題的不同捆綁包。什麼是按大小和頻率對多個陣列的公共共享子陣列進行排序的有效方法?
因此,我可能會有我的「花園裏的東西」包,這將是[「花園」,「美麗的女人」,「treesoutside」,「greenlondon」]等等。它們通常長達二十至三十件。
有時他們可能會有幾個這樣的東西來保持變化。
我想要做的是通過查看他們發佈的過去的圖片,推薦使用一捆標籤。
這樣做,我將有標籤多個陣列,它們以前使用:
x = ["a", "b", "c", "d", "e"]
y = ["a", "b", "d", "e", "f", "g"]
z = ["a", "c", "d", "e", "f", "h"]
...
我想找到這些陣列條目的最大共同子集。
因此,在這種情況下,最大的子集在這三者中將是[「a」,「d」,「e」]。這很簡單,通過使用諸如x & y & z
之類的東西來實現天真。
不過,我想創建一個基於內的所有審議中的數組的大小和頻率,這些子集的排名,這樣我可以顯示標籤的最常用的包:
[
{bundle: ["a","d","e"], frequency: 3, size: 3},
{bundle: ["e","f"], frequency: 2, size: 2},
{bundle: ["a","b"], frequency: 2, size: 2},
{bundle: ["b","d"], frequency: 2, size: 2},
...
]
假設這些捆綁包的最小尺寸有限制,請說兩個項目。
我使用Elasticsearch進行索引,但是我發現嘗試使用聚合來做到這一點很具挑戰性,所以我將圖像提取到Ruby中,然後在那裏創建列表。
作爲第一遍,我遍歷了所有這些數組,然後使用MD5散列鍵作爲唯一標識符查找其他數組的所有子集。但是這限制了結果。我懷疑,再添加一些通行證使得這種方法效率很低。
require 'digest'
x = ["a", "b", "c", "d", "e"]
y = ["a", "b", "d", "e", "f", "g"]
z = ["a", "c", "d", "e", "f", "h"]
def bundle_report arrays
arrays = arrays.collect(&:sort)
working = {}
arrays.each do |array|
arrays.each do |comparison|
next if array == comparison
subset = array & comparison
key = Digest::MD5.hexdigest(subset.join(""))
working[key] ||= {subset: subset, frequency: 0}
working[key][:frequency] += 1
working[key][:size] = subset.length
end
end
working
end
puts bundle_report([x, y, z])
=> {"bb4a3fb7097e63a27a649769248433f1"=>{:subset=>["a", "b", "d", "e"], :frequency=>2, :size=>4}, "b6fdd30ed956762a88ef4f7e8dcc1cae"=>{:subset=>["a", "c", "d", "e"], :frequency=>2, :size=>4}, "ddf4a04e121344a6e7ee2acf71145a99"=>{:subset=>["a", "d", "e", "f"], :frequency=>2, :size=>4}}
添加第二遍這得到一個更好的結果:
def bundle_report arrays
arrays = arrays.collect(&:sort)
working = {}
arrays.each do |array|
arrays.each do |comparison|
next if array == comparison
subset = array & comparison
key = Digest::MD5.hexdigest(subset.join(""))
working[key] ||= {subset: subset, frequency: 0}
working[key][:frequency] += 1
working[key][:size] = subset.length
end
end
original_working = working.dup
original_working.each do |key, item|
original_working.each do |comparison_key, comparison|
next if item == comparison
subset = item[:subset] & comparison[:subset]
key = Digest::MD5.hexdigest(subset.join(""))
working[key] ||= {subset: subset, frequency: 0}
working[key][:frequency] += 1
working[key][:size] = subset.length
end
end
working
end
puts bundle_report([x, y, z])
=> {"bb4a3fb7097e63a27a649769248433f1"=>{:subset=>["a", "b", "d", "e"], :frequency=>2, :size=>4}, "b6fdd30ed956762a88ef4f7e8dcc1cae"=>{:subset=>["a", "c", "d", "e"], :frequency=>2, :size=>4}, "ddf4a04e121344a6e7ee2acf71145a99"=>{:subset=>["a", "d", "e", "f"], :frequency=>2, :size=>4}, "a562cfa07c2b1213b3a5c99b756fc206"=>{:subset=>["a", "d", "e"], :frequency=>6, :size=>3}}
您能否提供一個有效的方式來建立大亞的這個排名?
謝謝弗蘭基!我正在思考,我會回覆。真的很感謝你看看。 – stef