2013-08-24 24 views
0

我刪除重複的文字:從多個字符串

a = "This is Product A with property B and propery C. Buy it now!" 
b = "This is Product B with property X and propery Y. Buy it now!" 
c = "This is Product C having no properties. Buy it now!" 

我正在尋找一種算法,可以這樣做:

> magic(a, b, c) 
=> ['A with property B and propery C', 
    'B with property X and propery Y', 
    'C having no properties'] 

我必須找到在1000+文本重複。超級表演不是必須的,但會很好。

- 更新

我正在尋找單詞序列。所以,如果:

d = 'This is Product D with text engraving: "Buy". Buy it now!' 

第一個「賣」不應該重複。我猜測我必須使用n之後的字眼,以便看作是重複的。

+2

問題不明確?如何定義重複的文本? –

+1

爲什麼「有財產」在重複時不重複? :D – fl00r

+1

1)如果有第四個字符串「Bumblebee zebra」。 '魔術(a,b,c,d)'會被期望返回所有四個未修改的字符串? 2)預期如何使用位置信息,例如「魔術師」示例刪除了「立即購買!」儘管事實上這是字符串的不同部分。可能你正在尋找一個'diff'函數? –

回答

3
def common_prefix_length(*args) 
    first = args.shift 
    (0..first.size).find_index { |i| args.any? { |a| a[i] != first[i] } } 
end 

def magic(*args) 
    i = common_prefix_length(*args) 
    args = args.map { |a| a[i..-1].reverse } 
    i = common_prefix_length(*args) 
    args.map { |a| a[i..-1].reverse } 
end 

a = "This is Product A with property B and propery C. Buy it now!" 
b = "This is Product B with property X and propery Y. Buy it now!" 
c = "This is Product C having no properties. Buy it now!" 

magic(a,b,c) 
# => ["A with property B and propery C", 
#  "B with property X and propery Y", 
#  "C having no properties"] 
+0

我喜歡你的解決方案看序列而不是單個單詞! – Willian

3

你的數據

sentences = [ 
    "This is Product A with property B and propery C. Buy it now!", 
    "This is Product B with property X and propery Y. Buy it now!", 
    "This is Product C having no properties. Buy it now!" 
] 

你的魔法

def magic(data) 
    prefix, postfix = 0, -1 
    data.map{ |d| d[prefix] }.uniq.compact.size == 1 && prefix += 1 or break while true 
    data.map{ |d| d[postfix] }.uniq.compact.size == 1 && prefix > -postfix && postfix -= 1 or break while true 
    data.map{ |d| d[prefix..postfix] } 
end 

你的輸出

magic(sentences) 
#=> [ 
#=> "A with property B and propery C", 
#=> "B with property X and propery Y", 
#=> "C having no properties" 
#=> ] 

或者你可以使用loop代替while true

def magic(data) 
    prefix, postfix = 0, -1 
    loop{ data.map{ |d| d[prefix] }.uniq.compact.size == 1 && prefix += 1 or break } 
    loop{ data.map{ |d| d[postfix] }.uniq.compact.size == 1 && prefix > -postfix && postfix -= 1 or break } 
    data.map{ |d| d[prefix..postfix] } 
end 
+0

當'data'碰巧是一串相同的字符串時,你的'magic'不會終止。你必須檢查'prefix'和'postfix'索引,這個位置的'd'中的字符存在。 – sawa

+0

好抓,@sawa!固定 – fl00r

-1

編輯:此代碼有錯誤。只是留下我的回答供參考,因爲如果人們在被降低評分後刪除答案,我不喜歡它。每個人都會犯錯誤:-)

我喜歡@filttru的方法,但覺得代碼不必要的複雜。這裏是我的嘗試:

def common_prefix_length(strings) 
    i = 0 
    i += 1 while strings.map{|s| s[i] }.uniq.size == 1 
    i 
end 

def common_suffix_length(strings) 
    common_prefix_length(strings.map(&:reverse)) 
end 

def uncommon_infixes(strings) 
    pl = common_prefix_length(strings) 
    sl = common_suffix_length(strings) 
    strings.map{|s| s[pl...-sl] } 
end 

由於OP可關注業績,我做了一個快速基準:

require 'fruity' 
require 'securerandom' 

prefix = 'PREFIX ' 
suffix = ' SUFFIX' 
test_data = Array.new(1000) do 
    prefix + SecureRandom.hex + suffix 
end 

def fl00r_meth(data) 
    prefix, postfix = 0, -1 
    data.map{ |d| d[prefix] }.uniq.size == 1 && prefix += 1 or break while true 
    data.map{ |d| d[postfix] }.uniq.size == 1 && postfix -= 1 or break while true 
    data.map{ |d| d[prefix..postfix] } 
end 

def falsetru_common_prefix_length(*args) 
    first = args.shift 
    (0..first.size).find_index { |i| args.any? { |a| a[i] != first[i] } } 
end 

def falsetru_meth(*args) 
    i = falsetru_common_prefix_length(*args) 
    args = args.map { |a| a[i..-1].reverse } 
    i = falsetru_common_prefix_length(*args) 
    args.map { |a| a[i..-1].reverse } 
end 

def padde_common_prefix_length(strings) 
    i = 0 
    i += 1 while strings.map{|s| s[i] }.uniq.size == 1 
    i 
end 

def padde_common_suffix_length(strings) 
    padde_common_prefix_length(strings.map(&:reverse)) 
end 

def padde_meth(strings) 
    pl = padde_common_prefix_length(strings) 
    sl = padde_common_suffix_length(strings) 
    strings.map{|s| s[pl...-sl] } 
end 

compare do 
    fl00r do 
    fl00r_meth(test_data.dup) 
    end 

    falsetru do 
    falsetru_meth(*test_data.dup) 
    end 

    padde do 
    padde_meth(test_data.dup) 
    end 
end 

這些結果如下:

Running each test once. Test will take about 1 second. 
fl00r is similar to padde 
padde is faster than falsetru by 30.000000000000004% ± 10.0% 
+1

願意解僱他的反對者嗎? –

+1

當數據碰巧是一個相同字符串的數組時,您的代碼將不會終止。你必須檢查'i'索引,該位置字符串中的字符存在。 – sawa

+0

您的代碼與我的第一版答案類似。我改爲當前版本,因爲我認爲創建/刪除中間數組('map {..} .uniq.size')可能會導致性能下降。根據你的基準,我錯了。 ;) – falsetru