2014-08-28 64 views
1

在Objective C中,有一種將多字節Unicode字節數組轉換爲NSString的方法,即使數組數據是部分緩衝區(不在完整的字符邊界上)?使用部分緩衝區將多字節Unicode字節數組轉換爲NSString

這是當在流中接收字節緩衝區,並且你想要解析數據緩衝區的字符串版本(但是會有更多數據來的時候,並且你的緩衝區數據沒有完整的多字節Unicode)的。

的NSString的initWithData:encoding:方法不適用於此目的,如下所示...

測試代碼:

- (void)test { 
     char myArray[] = {'f', 'o', 'o', (char) 0xc3, (char) 0x97, 'b', 'a', 'r'}; 
     size_t sizeOfMyArray = sizeof(myArray); 
     [self dump:myArray sizeOfMyArray:sizeOfMyArray]; 
     [self dump:myArray sizeOfMyArray:sizeOfMyArray - 1]; 
     [self dump:myArray sizeOfMyArray:sizeOfMyArray - 2]; 
     [self dump:myArray sizeOfMyArray:sizeOfMyArray - 3]; 
     [self dump:myArray sizeOfMyArray:sizeOfMyArray - 4]; 
     [self dump:myArray sizeOfMyArray:sizeOfMyArray - 5]; 
    } 

    - (void)dump:(char[])myArray sizeOfMyArray:(size_t)sourceLength { 
     NSString *string = [[NSString alloc] initWithData:[NSData dataWithBytes:myArray length:sourceLength] encoding:NSUTF8StringEncoding]; 
     NSLog(@"sourceLength: %lu bytes, string.length: %i bytes, string :'%@'", sourceLength, string.length, string); 
    } 

輸出:

sourceLength: 8 bytes, string.length: 7 bytes, string :'foo×bar' 
sourceLength: 7 bytes, string.length: 6 bytes, string :'foo×ba' 
sourceLength: 6 bytes, string.length: 5 bytes, string :'foo×b' 
sourceLength: 5 bytes, string.length: 4 bytes, string :'foo×' 
sourceLength: 4 bytes, string.length: 0 bytes, string :'(null)' 
sourceLength: 3 bytes, string.length: 3 bytes, string :'foo' 

可以看出,轉換「sourceLength:4字節」字節數組失敗,並返回(null)。這是因爲UTF-8 unicode'×'字符(0xc3 0x97)只是部分包含在內。

理想情況下,會有一個函數,我可以使用它會返回正確的NString,並告訴我有多少字節「剩餘」。

回答

2

你很大程度上有你自己的答案。如果initWithData:dataWithBytes:encoding:方法返回nil,那麼您知道緩衝區末尾有部分(無效)字符。

修改dump返回int。然後試圖在循環中創建NSString。每次獲得nil時,請縮短長度並重試。一旦得到有效的NSString,返回使用長度和傳遞長度之間的差異。

+0

謝謝...給你一個給予好評您的想法。你說的是正確的,但是在某些情況下,性能非常糟糕。在這種情況下,性能至關重要,因爲我可能正在處理千兆字節的數據。 – TJez 2014-08-28 16:12:22

0

這是我低效率的實施,我認爲這不是一個正確的答案。我會離開這裏,以防別人發現它很有用(在希望別人將給予比這更好的答案!)

這是在NSMutableData類別...

/** 
    * Removes the biggest string possible from this NSMutableData, leaving any remainder unicode half-characters behind. 
    * 
    * NOTE: This is a very inefficient implementation, it may require multiple parsing of the complete NSMutableData buffer, 
    * it is especially inefficient when the data buffer does not contain a valid string encoding, as all lengths will be 
    * attempted. 
    */ 
    - (NSString *)removeMaximumStringUsingEncoding:(NSStringEncoding)encoding { 
     if (self.length > 0) { 
      // Quick test for the case where the whole buffer can be used (is common case, and doesn't require NSData manipulation). 
      NSString *result = [[NSString alloc] initWithData:self encoding:encoding]; 
      if (result != Nil) { 
       self.length = 0; // Simple case, we used the whole buffer. 
       return result; 
      } 

      // Try to find the largest subData that is a valid string. 
      for (NSUInteger subDataLength = self.length - 1; subDataLength > 0; subDataLength--) { 
       NSRange subDataRange = NSMakeRange(0, subDataLength); 
       result = [[NSString alloc] initWithData:[self subdataWithRange:subDataRange] encoding:encoding]; 
       if (result != Nil) { 
        // Delete the bytes we used from our buffer, leave the remainder. 
        [self replaceBytesInRange:subDataRange withBytes:Nil length:0]; 
        return result; 
       } 
      } 
     } 
     return @""; 
    } 
2

我之前有過這個問題,並忘了一段時間。這是一個機會。下面的代碼是通過utf-8 page on wikipedia的信息完成的。它是NSData上的一個類別。

它檢查從最後的數據,只有最後四個字節,因爲OP說它可以是千兆字節的數據。否則,使用utf-8從頭開始運行字節會更簡單。

/* 
Return the range of a valid utf-8 encoded text by 
removing partial trailing multi-byte char. 
It assumes that all the bytes are valid utf-8 encoded char, 
e.g. it don't raise a flag if a continuation byte is preceded 
by a single char byte. 
*/ 
- (NSRange)rangeOfUTF8WithoutPartialTrailingMultibytes 
{ 
    NSRange validRange = {0, 0}; 

    NSUInteger trailLength = MIN([self length], 4U); 
    unsigned char trail[4]; 
    [self getBytes:&trail 
      range:NSMakeRange([self length] - trailLength, trailLength)]; 

    unsigned multibyteCount = 0; 

    for (NSInteger i = trailLength - 1; i >= 0; i--) { 
     if (isUTF8SingleByte(trail[i])) { 
      validRange = NSMakeRange(0, [self length] - trailLength + i + 1); 
      break; 
     } 

     if (isUTF8ContinuationByte(trail[i])) { 
      multibyteCount++; 
      continue; 
     } 

     if (isUTF8StartByte(trail[i])) { 
      multibyteCount++; 
      if (multibyteCount == lengthForUTF8StartByte(trail[i])) { 
       validRange = NSMakeRange(0, [self length] - trailLength + i + multibyteCount); 
      } 
      else { 
       validRange = NSMakeRange(0, [self length] - trailLength + i); 
      } 
      break; 
     } 
    } 
    return validRange; 
} 

下面是該方法中使用的靜態函數:

static BOOL isUTF8SingleByte(const unsigned char c) 
{ 
    return c <= 0x7f; 
} 

static BOOL isUTF8ContinuationByte(const unsigned char c) 
{ 
    return (c >= 0x80) && (c <= 0xbf); 
} 

static BOOL isUTF8StartByte(const unsigned char c) 
{ 
    return (c >= 0xc2) && (c <= 0xf4); 
} 

static BOOL isUTF8InvalidByte(const unsigned char c) 
{ 
    return (c == 0xc0) || (c == 0xc1) || (c > 0xf4); 
} 

static unsigned lengthForUTF8StartByte(const unsigned char c) 
{ 
    if ((c >= 0xc2) && (c <= 0xdf)) { 
     return 2; 
    } 
    else if ((c >= 0xe0) && (c <= 0xef)) { 
     return 3; 
    } 
    else if ((c >= 0xf0) && (c <= 0xf4)) { 
     return 4; 
    } 
    return 1; 
}