2013-02-10 32 views
2

我嘗試使用NSInputStream在iOS中讀取一個大文件,用換行符分隔文件行(我不想使用componentsSeparatedByCharactersInSet,因爲它使用了太多的內存)。如何讀取UTF-8的NSInputStream?

但是由於並非所有的行看起來都是UTF-8編碼的(因爲它們可能只是ASCII,相同的字節),我經常會得到Incorrect NSStringEncoding value 0x0000 detected. Assuming NSASCIIStringEncoding. Will stop this compatiblity mapping behavior in the near future.警告。

我的問題是:有沒有辦法通過例如抑制這種警告設置編譯器標誌?

此外:它是保存來追加/連接兩個緩衝讀取,從字節流中讀取,然後將緩衝區轉換爲字符串,然後追加字符串可以使字符串損壞?

下面的示例方法演示了字符串轉換將放棄UTF-8字符的前半部分和後半部分,因爲它是無效的。

- (void)NSInputStreamTest { 
    uint8_t testString[] = {0xd0, 0x91}; // @"Б" 

    // Test 1: Read max 1 byte at a time of UTF-8 string 
    uint8_t buf1[1], buf2[1]; 
    NSString *s1, *s2, *s3; 
    NSInteger c1, c2; 
    NSInputStream *inStream = [[NSInputStream alloc] initWithData:[[NSData alloc] initWithBytes:testString length:2]]; 

    [inStream open]; 
    c1 = [inStream read:buf1 maxLength:1]; 
    s1 = [[NSString alloc] initWithBytes:buf1 length:1 encoding:NSUTF8StringEncoding]; 
    NSLog(@"Test 1: Read %d byte(s): %@", c1, s1); 
    c2 = [inStream read:buf2 maxLength:1]; 
    s2 = [[NSString alloc] initWithBytes:buf2 length:1 encoding:NSUTF8StringEncoding]; 
    NSLog(@"Test 1: Read %d byte(s): %@", c2, s2); 
    s3 = [s1 stringByAppendingString:s2]; 
    NSLog(@"Test 1: Concatenated: %@", s3); 
    [inStream close]; 

    // Test 2: Read max 2 bytes at a time of UTF-8 string 
    uint8_t buf4[2]; 
    NSString *s4; 
    NSInteger c4; 
    NSInputStream *inStream2 = [[NSInputStream alloc] initWithData:[[NSData alloc] initWithBytes:testString length:2]]; 

    [inStream2 open]; 
    c4 = [inStream2 read:buf4 maxLength:2]; 
    s4 = [[NSString alloc] initWithBytes:buf4 length:2 encoding:NSUTF8StringEncoding]; 
    NSLog(@"Test 2: Read %d byte(s): %@", c4, s4); 
    [inStream2 close]; 
} 

輸出:

2013-02-10 21:16:23.412 Test[11144:c07] Test 1: Read 1 byte(s): (null) 
2013-02-10 21:16:23.413 Test[11144:c07] Test 1: Read 1 byte(s): (null) 
2013-02-10 21:16:23.413 Test[11144:c07] Test 1: Concatenated: (null) 
2013-02-10 21:16:23.413 Test[11144:c07] Test 2: Read 2 byte(s): Б 
+0

那麼,你的問題是什麼? – 2013-02-10 14:27:02

+0

@ 0x7fffffff我加了問題,對不起。它可能也是雙重的。 – Kreisquadratur 2013-02-10 18:35:20

回答

1

首先,在行:s3 = [s1 stringByAppendingString:s2];你試圖連接到'零'值。結果也是'無'。所以,你可能會想連接字節,而不是字符串:

uint8_t buf3[2]; 
buf3[0] = buf1[0]; 
buf3[1] = buf2[0]; 
s3 = [[NSString alloc] initWithBytes:buf3 length:2 encoding:NSUTF8StringEncoding]; 

輸出:

2015-11-06 12:57:40.304 Test[10803:883182] Test 1: Read 1 byte(s): (null) 
2015-11-06 12:57:40.305 Test[10803:883182] Test 1: Read 1 byte(s): (null) 
2015-11-06 12:57:40.305 Test[10803:883182] Test 1: Concatenated: Б 

次,UTF-8字符的長度可以躺在[1..6]字節。

(1 byte) 0aaa aaaa   //if symbol lays in 0x00 .. 0x7F (ASCII) 
(2 bytes) 110x xxxx 10xx xxxx 
(3 bytes) 1110 xxxx 10xx xxxx 10xx xxxx 
(4 bytes) 1111 0xxx 10xx xxxx 10xx xxxx 10xx xxxx 
(5 bytes) 1111 10xx 10xx xxxx 10xx xxxx 10xx xxxx 10xx xxxx 
(6 bytes) 1111 110x 10xx xxxx 10xx xxxx 10xx xxxx 10xx xxxx 10xx xxxx 

所以,如果你打算從NSInputStream原始字節讀取,然後將其轉化爲UTF-8的NSString,你可能想從NSInputStream字節讀取字節,直到你會得到有效的字符串:

#define MAX_UTF8_BYTES 6 
NSString *utf8String; 
NSMutableData *_data = [[NSMutableData alloc] init]; //for easy 'appending' bytes 

int bytes_read = 0; 
while (!utf8String) { 
    if (bytes_read > MAX_UTF8_BYTES) { 
     NSLog(@"Can't decode input byte array into UTF8."); 
     return; 
    } 
    else { 
     uint8_t byte[1]; 
     [_inputStream read:byte maxLength:1]; 
     [_data appendBytes:byte length:1]; 
     utf8String = [NSString stringWithUTF8String:[_data bytes]]; 
     bytes_read++; 
    } 
} 
0

ASCII(以及因此換行符)是UTF-8的子集,所以不應該有任何衝突。

應該可以將您的流劃分爲換行符,就像在簡單的ASCII流中一樣。然後,您可以使用UTF-8將每個塊(「行」)轉換爲NSString

您確定編碼錯誤不是真實的,即您的流實際上可能包含有關UTF-8編碼的錯誤字符嗎?

編輯從添加註釋:

這假設線包括足夠的幾個字符在內存中保留一整行從UTF-8轉換之前。

+0

對於第二部分你是對的:字符串實際上包含錯誤的字符/字節。然而,第一部分並不是微不足道的,因爲在轉換爲「NSString」時,您可能會分割一個UTF-8字符並將其分成兩半。 – Kreisquadratur 2013-02-10 20:22:20

+0

實際上換行符不能用來組成一個多字節字符。 UTF-8保證在多字節中不使用ASCII字符。但是可能發生拆分,因爲緩衝區大小可能小於要讀取的字符串。 – Kreisquadratur 2013-02-10 22:18:58

+0

@Kreisquadratur啊,我假設你可以在解碼之前閱讀整行。 – Monolo 2013-02-11 02:22:14

相關問題