識別兩個字符串中所有常見子序列的算法

-3

我需要識別給定兩個字符串的所有子序列。最長的公共子序列將只識別最長的一個。但是在這裏我想要所有的子序列超過一個閾值。任何特定的算法或方法？識別兩個字符串中所有常見子序列的算法

像這樣

Julie loves me more than Linda loves me 
Jane likes me more than Julie loves me

如果閾值是2，以下是這些2串的公共子序列

me more than 
loves me

來源

2017-03-05 Sameera SM

Set<String> allCS;//create an empty set 
String[] subStrings = getSubSequences(string2); //find the subsequence of string2(smaller string) 
for (String str : subStrings) { 
    String lcs = LCS(string1, str); 
    if(lcs.length > THRESHOLD) { 
     allCS.put(lcs); 
    } 
}

這裏，getSubSequences(String s)返回所有的子序列的給定字符串參數並且LCS(String s1, String s2)返回s1和s2的LCS。

getSubSequences(String s)可以使用位掩碼方法或遞歸地實現。

LCS(String s1, String s2)可以使用O(n^2)動態規劃方法實現，稍後跟蹤DP表中的路徑以打印最長的子序列b字符串。

如果由於存在2^length(string) - 1子序列，所以如果較小的字符串很長，這將不會有效。

來源

2017-03-05 04:50:00

@downvoter，把你的原因在這裏 –

由於這是一個算法問題，所以我認爲語言無關緊要。我的方法是生成這兩個字符串之間的所有子序列，並找到超過閾值的子序列。

Python代碼（Java的不應該是一個困難得多）：

def common_subsequences(a, b, threshold): 
    # tokenize two string (keep order) 
    tokens_a = a.split() 
    tokens_b = b.split() 
    # store all common subsequences 
    common = set() 
    # with each token in a 
    for i, token in enumerate(tokens_a): 
     # if it also appears in b 
     # then this should be a starting point for a common subsequence 
     if token in tokens_b: 
      # get the first occurence of token in b 
      # and start from there 
      j = tokens_b.index(token) 
      k = i 
      temp = token 
      # since we need all subsequences, we get length-1 subsequences too 
      common.add(temp) 
      # while still have token in common 
      while j < len(tokens_b) and k < len(tokens_a): 
       if j + 1 < len(tokens_b) and k + 1 < len(tokens_a) and tokens_b[j+1] == tokens_a[k+1]: 
        temp += " " + tokens_b[j+1] 
        j += 1 
        k += 1 
        # adding (new) common subsequences 
        common.add(temp) 
       # or else we break 
       else: 
        break 
    # we only get the ones having length >= threshold 
    return [s for s in common if len(s.split()) >= threshold] 

a = "Julie loves me more than Linda loves me" 
b = "Jane likes me more than Julie loves me" 
print common_subsequences(a, b, 2)

所有common_subsequences：

set(['me', 'more than', 'Julie', 'Julie loves', 'Julie loves me', 'me more', 'loves', 'more', 'than', 'me more than', 'loves me'])

common_subsequences是> =門檻：

['more than', 'Julie loves', 'Julie loves me', 'me more', 'me more than', 'loves me']

來源

2017-03-05 04:58:34 mrn0c

識別兩個字符串中所有常見子序列的算法

回答

相關問題