我需要識別給定兩個字符串的所有子序列。最長的公共子序列將只識別最長的一個。但是在這裏我想要所有的子序列超過一個閾值。任何特定的算法或方法?識別兩個字符串中所有常見子序列的算法
像這樣
Julie loves me more than Linda loves me
Jane likes me more than Julie loves me
如果閾值是2,以下是這些2串的公共子序列
me more than
loves me
我需要識別給定兩個字符串的所有子序列。最長的公共子序列將只識別最長的一個。但是在這裏我想要所有的子序列超過一個閾值。任何特定的算法或方法?識別兩個字符串中所有常見子序列的算法
像這樣
Julie loves me more than Linda loves me
Jane likes me more than Julie loves me
如果閾值是2,以下是這些2串的公共子序列
me more than
loves me
Set<String> allCS;//create an empty set
String[] subStrings = getSubSequences(string2); //find the subsequence of string2(smaller string)
for (String str : subStrings) {
String lcs = LCS(string1, str);
if(lcs.length > THRESHOLD) {
allCS.put(lcs);
}
}
這裏,getSubSequences(String s)
返回所有的子序列的給定字符串參數並且LCS(String s1, String s2)
返回s1
和s2
的LCS。
getSubSequences(String s)
可以使用位掩碼方法或遞歸地實現。
LCS(String s1, String s2)
可以使用O(n^2)
動態規劃方法實現,稍後跟蹤DP表中的路徑以打印最長的子序列b字符串。
如果由於存在2^length(string) - 1
子序列,所以如果較小的字符串很長,這將不會有效。
由於這是一個算法問題,所以我認爲語言無關緊要。我的方法是生成這兩個字符串之間的所有子序列,並找到超過閾值的子序列。
Python代碼(Java的不應該是一個困難得多):
def common_subsequences(a, b, threshold):
# tokenize two string (keep order)
tokens_a = a.split()
tokens_b = b.split()
# store all common subsequences
common = set()
# with each token in a
for i, token in enumerate(tokens_a):
# if it also appears in b
# then this should be a starting point for a common subsequence
if token in tokens_b:
# get the first occurence of token in b
# and start from there
j = tokens_b.index(token)
k = i
temp = token
# since we need all subsequences, we get length-1 subsequences too
common.add(temp)
# while still have token in common
while j < len(tokens_b) and k < len(tokens_a):
if j + 1 < len(tokens_b) and k + 1 < len(tokens_a) and tokens_b[j+1] == tokens_a[k+1]:
temp += " " + tokens_b[j+1]
j += 1
k += 1
# adding (new) common subsequences
common.add(temp)
# or else we break
else:
break
# we only get the ones having length >= threshold
return [s for s in common if len(s.split()) >= threshold]
a = "Julie loves me more than Linda loves me"
b = "Jane likes me more than Julie loves me"
print common_subsequences(a, b, 2)
所有common_subsequences:
set(['me', 'more than', 'Julie', 'Julie loves', 'Julie loves me', 'me more', 'loves', 'more', 'than', 'me more than', 'loves me'])
common_subsequences是> =門檻:
['more than', 'Julie loves', 'Julie loves me', 'me more', 'me more than', 'loves me']
@downvoter,把你的原因在這裏 –