Tokenize NSString在Objective-C中出現兩次

我在Objective-C中沒有太多經驗，如果這真的很明顯，對不起。Tokenize NSString在Objective-C中出現兩次

我需要的是將NSString拆分爲令牌。令牌由空格或另一個符號（不是字母）分隔。問題是我想保留分隔符，除非它們是空格。

示例短語：「a b c，d's，e f。」從這個我想獲得：

"a" 
"b" 
"c" 
"," 
"d" 
"'" 
"s" 
"," 
"e" 
"f" 
"."

有了這個代碼：

NSMutableCharacterSet *separators = [NSMutableCharacterSet punctuationCharacterSet]; 
[separators formUnionWithCharacterSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]]; 

NSArray *parse_array = [intext componentsSeparatedByCharactersInSet:separators];

我只得到信件。如果我只是過濾空白區域和NL，我會將字母和符號連在一起。我需要的是按順序執行兩個解析（首先是空格和Nl，然後是標點符號），但我真的不知道如何在objective-c中執行解析。任何人都可以給我一個提示嗎？

謝謝！

來源

2011-06-27 Miguel E

嗯，你可以做這樣的事情，從一個字符串中刪除所有的空白：

NSArray * t = [string componentsSeparatedByCharactersInSet:[NSCharacterSet whitespaceCharacterSet]]; 
string = [t componentsJoinedByString:@""];

然後你可以只遍歷字符，把它們變成NSStrings：

NSMutableArray *tokens = [NSMutableArray array]; 
for (NSUInteger i = 0; i < [string length]; ++i) { 
    unichar character = [string characterAtIndex:i]; 
    NSString *token = [NSString stringWithFormat:@"%C", character]; 
    [tokens addObject:token]; 
} 
NSLog(@"%@", tokens);

或者如果你之前不想去掉空白，你可以在循環中進行：

NSMutableArray *tokens = [NSMutableArray array]; 
for (NSUInteger i = 0; i < [string length]; ++i) { 
    unichar character = [string characterAtIndex:i]; 
    if ([[NSCharacterSet whitespaceCharacterSet] characterIsMember:character]) { 
    continue; 
    } 
    NSString *token = [NSString stringWithFormat:@"%C", character]; 
    [tokens addObject:token]; 
} 
NSLog(@"%@", tokens);

來源

2011-06-27 18:32:15

對不起，誤導你，但我的例句只有字母，但目的是要用它來解析單詞。我將添加一些緩衝區並調整解決方案。謝謝！ –

我知道它與這段代碼一起工作。這適用於字母或文字：

//parse the phrase into tokens. Punctuation will be tokenized too. 
NSMutableArray *tokens = [NSMutableArray array]; 
NSInteger last_word_start = -1; 
// 
for (NSUInteger i = 0; i < [myPhrase length]; ++i) 
{ 
    unichar character = [myPhrase characterAtIndex:i]; 
    if ([[NSCharacterSet whitespaceCharacterSet] characterIsMember:character]) 
    { 
     if (last_word_start >= 0) 
      [tokens addObject:[myPhrase substringWithRange:NSMakeRange(last_word_start, i-last_word_start)]]; 
     last_word_start = -1; 
    } 
    else 
    { 
     if ([[NSCharacterSet punctuationCharacterSet] characterIsMember:character]) 
     { 
      if (last_word_start >= 0) 
       [tokens addObject:[myPhrase substringWithRange:NSMakeRange(last_word_start, i-last_word_start)]]; 
      [tokens addObject:[NSString stringWithFormat:@"%C", character]]; 
      last_word_start = -1; 
     } 
     else 
     { 
      if (last_word_start == -1) 
       last_word_start = i; 
     } 
    } 
} 
//save pending letters 
if (last_word_start >= 0) 
    [tokens addObject:[myPhrase substringWithRange:NSMakeRange(last_word_start, [myPhrase length]-last_word_start)]]; 
NSLog(@"Tokens for phrase '%@':",myPhrase); 
NSLog(@"%@", tokens);

謝謝！

來源

2011-06-28 09:57:18

看看我的開源可可字符串標記化/分析工具：ParseKit：

http://parsekit.com

ParseKit包含一個非常強大的/靈活tokenizer類：PKTokenizer。默認情況下，PKTokenizer將默默使用空白標記而不報告它們。（在這種情況下，這是你想要的，但如果你沒有這種行爲可以配置。）

下面是你可以使用PKTokenizer對於這個特殊的任務：

// create the tokenizer with your string 
NSString *inStr = @"a b c,d's, e f."; 
PKTokenizer *t = [PKTokenizer tokenizerWithString:inStr]; 

// configure the tokenizer to not allow apostrophes inside words (that's the default) 
[t.wordState setWordChars:NO from:'\'' to:'\'']; 

// loop thru the input and concat the non-whitespace chars 
PKToken *eof = [PKToken EOFToken]; 
PKToken *tok = nil; 

NSMutableArray *outStrs = [NSMutableArray array]; 
while ((tok = [t nextToken]) != eof) { 
    [outStrs addObject:tok.stringValue]; 
}

outStrs包含：

「一」「b」「c」的「」「d」「'」「S」「，」「e」「f」「。」

對於這個特定的任務，ParseKit可能有點矯枉過正。但是，如果你有幾個類似的任務，這可能值得檢查，因爲它可以節省你的時間/痛苦。

來源

2011-10-13 16:36:56

Tokenize NSString在Objective-C中出現兩次

回答

相關問題