解碼部分UTF-8成的NSString

在拿在使用NSURLConnection類的網絡中的UTF-8 -encoded文件，有一個很好的機會的代理的connection:didReceiveData:消息將與NSData被髮送，其截斷UTF-8文件 - 因爲UTF-8是一個多字節編碼方案，以及單個字符可以在兩個獨立的NSData解碼部分UTF-8成的NSString

發送換句話說，如果我同我從connection:didReceiveData:獲取數據，我將有一個有效的UTF-8文件，但每個單獨的數據是無效的UTF-8 （）。

我不想將所有下載的文件存儲在內存中。

我想要的是：給出NSData，解碼任何你可以到NSString。如果NSData的最後一個幾個字節是未關閉的代理，請告訴我，以便我可以將它們保存爲下一個NSData。

一個明顯的解決方案是使用initWithData:encoding:重複嘗試解碼，每次都截斷最後一個字節，直到成功。不幸的是，這可能非常浪費。

來源

2012-06-06 user1071136

如果你想確保你不會停留在UTF-8多字節序列的中間，那麼你需要查看字節數組的末尾並檢查最高2位。

如果最高位爲0，那麼它是ASCII樣式的UTF-8代碼之一，就這樣完成了。
如果最高位是1，而第二個最高位是0，那麼它是一個轉義序列的延續，並可能代表該序列的最後一個字節，因此您需要先緩衝該字符以便稍後再查看在前面的字符*
如果最高位是1，而第二從最高位也是1，那麼它是多字節序列的開始，您需要通過查找來確定序列中有多少個字符前0位。

看看在維基百科條目的多字節表：http://en.wikipedia.org/wiki/UTF-8

// assumes that receivedData contains both the leftovers and the new data 

unsigned char *data= [receivedData bytes]; 
UInteger byteCount= [receivedData length]; 

if (byteCount<1) 
    return nil; // or @""; 

unsigned char *lastByte = data[byteCount-1]; 
if (lastByte & 0x80 == 0) { 
    NSString *newString = [NSString initWithBytes: data length: byteCount 
            encoding: NSUTF8Encoding]; 
    // verify success 
    // remove bytes from mutable receivedData, or set overflow to empty 
    return newString; 
} 

// now eat all of the continuation bytes 
UInteger backCount=0; 
while ((byteCount > 0) && (lastByte & 0xc0 == 0x80)) { 
    backCount++; 
    byteCount--; 
    lastByte = data[byteCount-1]; 
} 
// at this point, either we have exhausted byteCount or we have the initial character 
// if we exhaust the byte count we're probably in an illegal sequence, as we should 
// always have the initial character in the receivedData 

if (byteCount<1) { 
    // error! 
    return nil; 
} 

// at this point, you can either use just byteCount, or you can compute the 
// length of the sequence from the lastByte in order 
// to determine if you have exactly the right number of characters to decode UTF-8. 

UInteger requiredBytes = 0; 
if (lastByte & 0xe0 == 0xc0) { // 110xxxxx 
    // 2 byte sequence 
    requiredBytes= 1; 
} else if (lastByte & 0xf0 == 0xe0) { // 1110xxxx 
    // 3 byte sequence 
    requiredBytes= 2; 
} else if (lastByte & 0xf8 == 0xf0) { // 11110xxx 
    // 4 byte sequence 
    requiredBytes= 3; 
} else if (lastByte & 0xfc == 0xf8) { // 111110xx 
    // 5 byte sequence 
    requiredBytes= 4; 
} else if (lastByte & 0xfe == 0xfc) { // 1111110x 
    // 6 byte sequence 
    requiredBytes= 5; 
} else { 
    // shouldn't happen, illegal UTF8 seq 
} 

// now we know how many characters we need and we know how many 
// (backCount) we have, so either use them, or take the 
// introductory character away. 
if (requiredBytes==backCount) { 
    // we have the right number of bytes 
    byteCount += backCount; 
} else { 
    // we don't have the right number of bytes, so remove the intro character 
    byteCount -= 1; 
} 

NSString *newString = [NSString initWithBytes: data length: byteCount 
           encoding: NSUTF8Encoding]; 
// verify success 
// remove byteCount bytes from mutable receivedData, or set overflow to the 
// bytes between byteCount and [receivedData count] 
return newString;

來源

2012-06-06 11:40:32 gaige

感謝您的回答！（同時我意識到這就是解決方案） - 我相信代碼本身有一些缺點。（1）我認爲'data [byteCount]'超出範圍，（2）'data'不一定包含代理的開始。我建議修復，爲後代。 – user1071136

我修復了錯誤，哎呀。至於不包含代理開始的可能性，除非數據出錯（並且存在檢查）否則不應該發生，或者您應該在調用此函數之前將先前的數據放入receivedData的開始位置（這是在開始時評論）。因此，數據不應包含序列的結尾而不是開頭。 – gaige

同意，我的錯誤。再次感謝！ – user1071136

UTF-8是一個非常簡單的解析編碼，它的設計目的是讓您輕鬆檢測不完整的序列，並且如果您從不完整序列中間開始查找其開頭。

從結尾向後搜索一個字節，它可以是< = 0x7f或> 0xc0。如果它是< = 0x7f，那就完成了。如果它位於0xc0和0xdf之間（包含），則需要完成一個後續字節。如果它介於0xe0和0xef之間，則需要完成兩個後續字節。如果> = 0xf0，則需要完成三個以下字節。

來源

2012-06-06 11:21:13

我有一個類似的問題 - 部分解碼UTF8

NSString * adsTopic = [components[2] stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]]; 
    adsInfo->adsTopic = malloc(sizeof(char) * adsTopic.length + 1); 
    strncpy(adsInfo->adsTopic, [adsTopic UTF8String], adsTopic.length + 1);

後[解決之前]

NSString *adsTopic = [components[2] stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]]; 
    NSUInteger byteCount = [adsTopic lengthOfBytesUsingEncoding:NSUTF8StringEncoding]; 
    NSLog(@"number of Unicode characters in the string topic == %lu",(unsigned long)byteCount); 

    adsInfo->adsTopic = malloc(byteCount+1); 
    strncpy(adsInfo->adsTopic, [adsTopic UTF8String], byteCount + 1); 

    NSString *text=[NSString stringWithCString:adsInfo.adsTopic encoding:NSUTF8StringEncoding]; 
       NSLog(@"=== %@", text);

來源

2016-01-26 15:04:17 dip

解碼部分UTF-8成的NSString

回答

相關問題