我認爲你應該使用Longest Common Subsequence代替Levenshtein距離的長度。這對你的情況來說似乎是一個更好的指標。本質上,它優先插入和刪除替換,如我在我的評論中所建議的。 (Inggers) - >「Ingersoll」和「Ing」 - >「Boylan」(分數爲3和1)處理沒有問題的空間(「New Ing」 - >「New Ingersoll」得分7其中「New Ing」 - >「Boylan」再次得分1),並且如果遇到像「Ingsl」這樣的縮寫,也會很好地工作。
算法很簡單。如果您的兩個字符串的長度爲m和n,請按字符比較字符串的連續前綴(以空白前綴開頭),將分數保存在大小爲m + 1,n + 1的矩陣中。如果某一對匹配,則在前兩個前綴的分數中添加一個(矩陣中剩下的一行上一列);否則保持這些前綴的兩個分數中的最高分(比較上面的條目和立即留下的條目並採取最好的條件)。當你經歷了兩個字符串時,得分矩陣中的最後一項是LCS的長度。
爲 「Ingsll」 和 「的Ingersoll」
例記分矩陣:
0 1 2 3 4 5 6
I n g s l l
---------------
0 | 0 0 0 0 0 0 0
1 I | 0 1 1 1 1 1 1
2 n | 0 1 2 2 2 2 2
3 g | 0 1 2 3 3 3 3
4 e | 0 1 2 3 3 3 3
5 r | 0 1 2 3 3 3 3
6 s | 0 1 2 3 4 4 4
7 o | 0 1 2 3 4 4 4
8 l | 0 1 2 3 4 5 5
9 l | 0 1 2 3 4 5 6
下面是一個ObjC執行長度。這裏的大部分複雜性只是由於想要處理組合字符序列 - 例如@"o̶"
- 正確。
#import <Foundation/Foundation.h>
@interface NSString (WSSComposedLength)
- (NSUInteger)WSSComposedLength;
@end
@implementation NSString (WSSComposedLength)
- (NSUInteger)WSSComposedLength
{
__block NSUInteger length = 0;
[self enumerateSubstringsInRange:(NSRange){0, [self length]}
options:NSStringEnumerationByComposedCharacterSequences | NSStringEnumerationSubstringNotRequired
usingBlock:^(NSString *substring, NSRange substringRange, NSRange enclosingRange, BOOL *stop) {
length++;
}];
return length;
}
@end
@interface NSString (WSSLongestCommonSubsequence)
- (NSUInteger)WSSLengthOfLongestCommonSubsequenceWithString:(NSString *)target;
- (NSString *)WSSLongestCommonSubsequenceWithString:(NSString *)target;
@end
@implementation NSString (WSSLongestCommonSubsequence)
- (NSUInteger)WSSLengthOfLongestCommonSubsequenceWithString:(NSString *)target
{
NSUInteger * const * scores;
scores = [[self scoreMatrixForLongestCommonSubsequenceWithString:target] bytes];
return scores[[target WSSComposedLength]][[self WSSComposedLength]];
}
- (NSString *)WSSLongestCommonSubsequenceWithString:(NSString *)target
{
NSUInteger * const * scores;
scores = [[self scoreMatrixForLongestCommonSubsequenceWithString:target] bytes];
//FIXME: Implement this.
return nil;
}
- (NSData *)scoreMatrixForLongestCommonSubsequenceWithString:(NSString *)target{
NSUInteger selfLength = [self WSSComposedLength];
NSUInteger targetLength = [target WSSComposedLength];
NSMutableData * scoresData = [NSMutableData dataWithLength:(targetLength + 1) * sizeof(NSUInteger *)];
NSUInteger ** scores = [scoresData mutableBytes];
for(NSUInteger i = 0; i <= targetLength; i++){
scores[i] = [[NSMutableData dataWithLength:(selfLength + 1) * sizeof(NSUInteger)] mutableBytes];
}
/* Ranges in the enumeration Block are the same measure as
* -[NSString length] -- i.e., 16-bit code units -- as opposed to
* _composed_ length, which counts code points. Thus:
*
* Enumeration will miss the last character if composed length is used
* as the range and there's a substring range with length greater than one.
*/
NSRange selfFullRange = (NSRange){0, [self length]};
NSRange targetFullRange = (NSRange){0, [target length]};
/* Have to keep track of these indexes by hand, rather than using the
* Block's substringRange.location because, e.g., @"o̶", will have
* range {x, 2}, and the next substring will be {x+2, l}.
*/
__block NSUInteger col = 0;
__block NSUInteger row = 0;
[target enumerateSubstringsInRange:targetFullRange
options:NSStringEnumerationByComposedCharacterSequences
usingBlock:^(NSString * targetSubstring,
NSRange targetSubstringRange,
NSRange _, BOOL * _0)
{
row++;
col = 0;
[self enumerateSubstringsInRange:selfFullRange
options:NSStringEnumerationByComposedCharacterSequences
usingBlock:^(NSString * selfSubstring,
NSRange selfSubstringRange,
NSRange _, BOOL * _0)
{
col++;
NSUInteger newScore;
if([selfSubstring isEqualToString:targetSubstring]){
newScore = 1 + scores[row - 1][col - 1];
}
else {
NSUInteger upperScore = scores[row - 1][col];
NSUInteger leftScore = scores[row][col - 1];
newScore = MAX(upperScore, leftScore);
}
scores[row][col] = newScore;
}];
}];
return scoresData;
}
@end
int main(int argc, const char * argv[])
{
@autoreleasepool {
NSArray * testItems = @[@{@"source" : @"Ingso̶ll",
@"targets": @[
@{@"string" : @"Ingersoll",
@"score" : @6,
@"sequence" : @"Ingsll"},
@{@"string" : @"Boylan",
@"score" : @1,
@"sequence" : @"n"},
@{@"string" : @"New Ingersoll",
@"score" : @6,
@"sequence" : @"Ingsll"}]},
@{@"source" : @"Ing",
@"targets": @[
@{@"string" : @"Ingersoll",
@"score" : @3,
@"sequence" : @"Ing"},
@{@"string" : @"Boylan",
@"score" : @1,
@"sequence" : @"n"},
@{@"string" : @"New Ingersoll",
@"score" : @3,
@"sequence" : @"Ing"}]},
@{@"source" : @"New Ing",
@"targets": @[
@{@"string" : @"Ingersoll",
@"score" : @3,
@"sequence" : @"Ing"},
@{@"string" : @"Boylan",
@"score" : @1,
@"sequence" : @"n"},
@{@"string" : @"New Ingersoll",
@"score" : @7,
@"sequence" : @"New Ing"}]}];
for(NSDictionary * item in testItems){
NSString * source = item[@"source"];
for(NSDictionary * target in item[@"targets"]){
NSString * targetString = target[@"string"];
NSCAssert([target[@"score"] integerValue] ==
[source WSSLengthOfLongestCommonSubsequenceWithString:targetString],
@"");
// NSCAssert([target[@"sequence"] isEqualToString:
// [source longestCommonSubsequenceWithString:targetString]],
// @"");
}
}
}
return 0;
}
你可以改變打分到折扣插入?對於這個縮寫用例,似乎你應該將替代計算爲比插入更大的距離。就我所知,你根本就沒有在尋找拼寫錯誤。 – 2014-11-06 20:46:56
對,我在計算大部分情況下用序列替換空格。這實際上是一個很好的觀察。但是,你如何計算這些數據呢? – Moshe 2014-11-06 20:50:56
嗯,這取決於你的得分的細節,但我說的是 - 爲了你的目的,但不是在經典的萊文斯坦距離 - 「我」 - >「B」應該比「我」 >「Ie」。 – 2014-11-06 20:56:07