在一個大的NSString中有效地找到許多關鍵字中的第一個

我需要在大的NSString（用於解析源代碼）中找到所有關鍵字，而且我當前的實現太慢，但我不確定如何改進它。在一個大的NSString中有效地找到許多關鍵字中的第一個

我使用NSRegularExpression，這是基於它比我能寫的任何東西都更優化的假設，但是性能比我預期的要慢。有誰知道更快的方式來實現這個？

目標字符串將包含utf-8字符，但關鍵字本身將始終爲純字母數字ascii。我想這可以用來優化一些東西？

@implementation MyClass 

// i'm storing the regular expression in a static variable, since it never changes and I need to re-use it often 
static NSRegularExpression *keywordsExpression; 

+ (void)initialize 
{ 
    [super initialize]; 

    NSArray *keywords = [NSArray arrayWithObjects:@"accumsan", @"adipiscing", @"aliquam", @"aliquet", @"amet", @"ante", @"arcu", @"at", @"commodo", @"congue", @"consectetur", @"consequat", @"convallis", @"cras", @"curabitur", @"cursus", @"dapibus", @"diam", @"dolor", @"dui", @"elit", @"enim", @"erat", @"eros", @"est", @"et", @"eu", @"felis", @"fermentum", @"gravida", @"iaculis", @"id", @"imperdiet", @"integer", @"ipsum", @"lacinia", @"lectus", @"leo", nil]; 

    NSString *pattern = [NSString stringWithFormat:@"\\b(%@)\\b", [keywords componentsJoinedByString:@"|"]; // \b(accumsan|adipiscing|aliquam|…)\b 
    keywordsExpression = [NSRegularExpression regularExpressionWithPattern:pattern] options:NSRegularExpressionCaseInsensitive error:NULL]; 
} 

// this method will be called in quick succession, I need it to be a able to run tens 
// of thousands of times per second. The target string is big (50KB or so), but the 
// search range is short, rarely more than 30 characters 
- (NSRange)findNextKeyword:(NSString *)string inRange:(NSRange)range 
{ 
    return [keywordsExpression rangeOfFirstMatchInString:string options:0 range:range]; 
} 

@end

編輯按@ CodeBrickie的回答，我已經更新了我的代碼對整個字符串進行一次正則表達式搜索，並保存火柴緩存NSIndexSet，則每次方法稱它搜索NSIndexSet的關鍵字範圍，而不是搜索字符串。結果大約快一個數量級：

@implementation MyClass 

static NSRegularExpression *keywordsExpression; 
static NSIndexSet *keywordIndexes = nil; 

+ (void)initialize 
{ 
    [super initialize]; 

    NSArray *keywords = [NSArray arrayWithObjects:@"accumsan", @"adipiscing", @"aliquam", @"aliquet", @"amet", @"ante", @"arcu", @"at", @"commodo", @"congue", @"consectetur", @"consequat", @"convallis", @"cras", @"curabitur", @"cursus", @"dapibus", @"diam", @"dolor", @"dui", @"elit", @"enim", @"erat", @"eros", @"est", @"et", @"eu", @"felis", @"fermentum", @"gravida", @"iaculis", @"id", @"imperdiet", @"integer", @"ipsum", @"lacinia", @"lectus", @"leo", nil]; 

    NSString *pattern = [NSString stringWithFormat:@"\\b(%@)\\b", [keywords componentsJoinedByString:@"|"]; // \b(accumsan|adipiscing|aliquam|…)\b 
    keywordsExpression = [NSRegularExpression regularExpressionWithPattern:pattern] options:NSRegularExpressionCaseInsensitive error:NULL]; 
} 

- (void)prepareToFindKeywordsInString:(NSString *)string 
{ 
    NSMutableIndexSet *keywordIndexesMutable = [[NSIndexSet indexSet] mutableCopy]; 
    [keywordsExpression enumerateMatchesInString:string options:0 range:NSMakeRange(0, string.length) usingBlock:^(NSTextCheckingResult *match, NSMatchingFlags flags, BOOL *stop){ 
    [keywordIndexesMutable addIndexesInRange:match.range]; 
    }]; 

    keywordIndexes = [keywordIndexesMutable copy]; 
} 

- (NSRange)findNextKeyword:(NSString *)string inRange:(NSRange)range 
{ 
    NSUInteger foundKeywordMax = (foundCharacterSetRange.location == NSNotFound) ? string.length : foundCharacterSetRange.location; 
    NSRange foundKeywordRange = NSMakeRange(NSNotFound, 0); 
    for (NSUInteger index = startingAt; index < foundKeywordMax; index++) { 
    if ([keywordIndexes containsIndex:index]) { 
     if (foundKeywordRange.location == NSNotFound) { 
     foundKeywordRange.location = index; 
     foundKeywordRange.length = 1; 
     } else { 
     foundKeywordRange.length++; 
     } 
    } else { 
     if (foundKeywordRange.location != NSNotFound) { 
     break; 
     } 
    } 
    } 

    return foundKeywordRange; 
} 

@end

這似乎工作得很好，並且性能已達到我想要的範圍。我想再等一會兒，看看在接受這個之前是否還有更多的建議。

來源

2011-11-18 Abhi Beckert

只是一個側面說明 - 您的'findNextKeyword：inRange：'方法不應該接受* NSRange指針*。 –

謝謝@JacobRelkin，我剛剛編輯過。這不是我的實際代碼，它只是儀器告訴我的部分太慢（我CPU時間的95％） –

由於您需要關鍵字以及它們的範圍，我會使用enumerateMatchesInString:options:range:usingBlock:並實現一個塊，將關鍵字作爲關鍵字並將範圍作爲值添加到NSMutableDictionary。

因此，在該調用之後，您只有一個調用整個字符串和所有關鍵字的範圍在字典中。

來源

2011-11-18 09:34:45 tobiasbayer

聽起來像是一個好主意，但我仍然需要一次訪問一個搜索結果我在這裏發佈的'-findNextKeyword：inRange：'方法。您能否想到一種將搜索結果按範圍進行索引的好方法，以便用這種方法查找它們？該方法正在被用於實時解析源代碼的文本編輯器（我可能會在未來幾周內開放源代碼）。 –

我將嘗試在與任何關鍵字匹配的字符串中創建每個字符的NSIndexSet。 –

我調整了我的答案。 – tobiasbayer

如果將搜索切換到某些NSLiteral搜索而不區分大小寫，會發生什麼情況？

當然，您需要不區分大小寫，但如果速度更快，它會給您提供建議。

最大速度將是更多的工作，並涉及較低的C字符串和strstr（），我敢打賭。

來源

2011-11-18 11:14:27

我需要它不區分大小寫。我意識到C代碼可能會更快，但我對C自己寫的代碼感到不舒服。 –

與其組裝正則表達式以匹配所有關鍵字，我建議您使用一個非常基本的正則表達式來匹配任何單詞，然後在包含關鍵字的字典中查找匹配的單詞;如果該單詞不在那裏，請忽略它繼續。

您可以根據您瞭解關鍵字的最大效率來定製正則表達式。例如，如果您知道您只查找三到十二個小寫ASCII字母的單詞，則可以使用@"\\b[a-z]{3,12}+\\b"。與你的怪物正則表達式相比，有幾十個選擇。

我已經在我自己的語法高亮項目中取得了巨大的成功。這是用Java編寫的，但快速查看NSRegularExpression文檔後，出現了非常相似的功能集。

來源

2011-11-18 11:15:36

我不認爲這將適合我的情況，因爲絕大多數單詞不會是關鍵字。但我會記住它。 –

這是否包含字符串文字，註釋等內部的單詞？這個正則表達式應該永遠不會看到這些。 –

我同意我的正則表達式可能慢一點，但下一步你建議，檢查文件中的每個單詞是否在'NSDictionary'中，肯定會很慢？我會嘗試一下。 –

在一個大的NSString中有效地找到許多關鍵字中的第一個

回答

相關問題