如果你想確保你不會停留在UTF-8多字節序列的中間,那麼你需要查看字節數組的末尾並檢查最高2位。
- 如果最高位爲0,那麼它是ASCII樣式的UTF-8代碼之一,就這樣完成了。
- 如果最高位是1,而第二個最高位是0,那麼它是一個轉義序列的延續,並可能代表該序列的最後一個字節,因此您需要先緩衝該字符以便稍後再查看在前面的字符*
- 如果最高位是1,而第二從最高位也是1,那麼它是多字節序列的開始,您需要通過查找來確定序列中有多少個字符前0位。
看看在維基百科條目的多字節表:http://en.wikipedia.org/wiki/UTF-8
// assumes that receivedData contains both the leftovers and the new data
unsigned char *data= [receivedData bytes];
UInteger byteCount= [receivedData length];
if (byteCount<1)
return nil; // or @"";
unsigned char *lastByte = data[byteCount-1];
if (lastByte & 0x80 == 0) {
NSString *newString = [NSString initWithBytes: data length: byteCount
encoding: NSUTF8Encoding];
// verify success
// remove bytes from mutable receivedData, or set overflow to empty
return newString;
}
// now eat all of the continuation bytes
UInteger backCount=0;
while ((byteCount > 0) && (lastByte & 0xc0 == 0x80)) {
backCount++;
byteCount--;
lastByte = data[byteCount-1];
}
// at this point, either we have exhausted byteCount or we have the initial character
// if we exhaust the byte count we're probably in an illegal sequence, as we should
// always have the initial character in the receivedData
if (byteCount<1) {
// error!
return nil;
}
// at this point, you can either use just byteCount, or you can compute the
// length of the sequence from the lastByte in order
// to determine if you have exactly the right number of characters to decode UTF-8.
UInteger requiredBytes = 0;
if (lastByte & 0xe0 == 0xc0) { // 110xxxxx
// 2 byte sequence
requiredBytes= 1;
} else if (lastByte & 0xf0 == 0xe0) { // 1110xxxx
// 3 byte sequence
requiredBytes= 2;
} else if (lastByte & 0xf8 == 0xf0) { // 11110xxx
// 4 byte sequence
requiredBytes= 3;
} else if (lastByte & 0xfc == 0xf8) { // 111110xx
// 5 byte sequence
requiredBytes= 4;
} else if (lastByte & 0xfe == 0xfc) { // 1111110x
// 6 byte sequence
requiredBytes= 5;
} else {
// shouldn't happen, illegal UTF8 seq
}
// now we know how many characters we need and we know how many
// (backCount) we have, so either use them, or take the
// introductory character away.
if (requiredBytes==backCount) {
// we have the right number of bytes
byteCount += backCount;
} else {
// we don't have the right number of bytes, so remove the intro character
byteCount -= 1;
}
NSString *newString = [NSString initWithBytes: data length: byteCount
encoding: NSUTF8Encoding];
// verify success
// remove byteCount bytes from mutable receivedData, or set overflow to the
// bytes between byteCount and [receivedData count]
return newString;
感謝您的回答! (同時我意識到這就是解決方案) - 我相信代碼本身有一些缺點。(1)我認爲'data [byteCount]'超出範圍,(2)'data'不一定包含代理的開始。我建議修復,爲後代。 – user1071136
我修復了錯誤,哎呀。至於不包含代理開始的可能性,除非數據出錯(並且存在檢查)否則不應該發生,或者您應該在調用此函數之前將先前的數據放入receivedData的開始位置(這是在開始時評論)。因此,數據不應包含序列的結尾而不是開頭。 – gaige
同意,我的錯誤。再次感謝 ! – user1071136