2017-03-05 60 views
-3

我需要識別給定兩個字符串的所有子序列。最長的公共子序列將只識別最長的一個。但是在這裏我想要所有的子序列超過一個閾值。任何特定的算法或方法?識別兩個字符串中所有常見子序列的算法

像這樣

Julie loves me more than Linda loves me 
Jane likes me more than Julie loves me 

如果閾值是2,以下是這些2串的公共子序列

me more than 
loves me 

回答

1
Set<String> allCS;//create an empty set 
String[] subStrings = getSubSequences(string2); //find the subsequence of string2(smaller string) 
for (String str : subStrings) { 
    String lcs = LCS(string1, str); 
    if(lcs.length > THRESHOLD) { 
     allCS.put(lcs); 
    } 
} 

這裏,getSubSequences(String s)返回所有的子序列的給定字符串參數並且LCS(String s1, String s2)返回s1s2的LCS。

getSubSequences(String s)可以使用位掩碼方法或遞歸地實現。

LCS(String s1, String s2)可以使用O(n^2)動態規劃方法實現,稍後跟蹤DP表中的路徑以打印最長的子序列b字符串。

如果由於存在2^length(string) - 1子序列,所以如果較小的字符串很長,這將不會有效。

+0

@downvoter,把你的原因在這裏 –

0

由於這是一個算法問題,所以我認爲語言無關緊要。我的方法是生成這兩個字符串之間的所有子序列,並找到超過閾值的子序列。

Python代碼(Java的不應該是一個困難得多):

def common_subsequences(a, b, threshold): 
    # tokenize two string (keep order) 
    tokens_a = a.split() 
    tokens_b = b.split() 
    # store all common subsequences 
    common = set() 
    # with each token in a 
    for i, token in enumerate(tokens_a): 
     # if it also appears in b 
     # then this should be a starting point for a common subsequence 
     if token in tokens_b: 
      # get the first occurence of token in b 
      # and start from there 
      j = tokens_b.index(token) 
      k = i 
      temp = token 
      # since we need all subsequences, we get length-1 subsequences too 
      common.add(temp) 
      # while still have token in common 
      while j < len(tokens_b) and k < len(tokens_a): 
       if j + 1 < len(tokens_b) and k + 1 < len(tokens_a) and tokens_b[j+1] == tokens_a[k+1]: 
        temp += " " + tokens_b[j+1] 
        j += 1 
        k += 1 
        # adding (new) common subsequences 
        common.add(temp) 
       # or else we break 
       else: 
        break 
    # we only get the ones having length >= threshold 
    return [s for s in common if len(s.split()) >= threshold] 

a = "Julie loves me more than Linda loves me" 
b = "Jane likes me more than Julie loves me" 
print common_subsequences(a, b, 2) 

所有common_subsequences:

set(['me', 'more than', 'Julie', 'Julie loves', 'Julie loves me', 'me more', 'loves', 'more', 'than', 'me more than', 'loves me']) 

common_subsequences是> =門檻:

['more than', 'Julie loves', 'Julie loves me', 'me more', 'me more than', 'loves me'] 
相關問題