我從捲曲生成的字符串，如「view-source:http://vvs24.com/component/contact/1.html」具有無效的utf-8字符的網頁。從c/C++中的字符串去除無效的utf8

在我的例子 'advsearch': ' Avansert s�k'， = ø但在某些其它編碼比UTF8。

我怎樣才能糾正這個錯誤，因爲我需要插入字符串到MySQL（使用C庫）和它能夠在無效字符修剪，甚至mysql_real_escape()後，因爲它是documented。

2013-06-26 Stefan Rogin

看起來好像服務器已經貼錯標籤一些其他編碼爲UTF-8。所以，你有（至少）兩個選項：

1）計算出真正的編碼是什麼，並重新代碼，爲UTF-8。我認爲，這個特定頁面是Latin-1的（當我wget的抓住它），但不是所有貼錯標籤的頁面一定是。對於它的價值，我在實踐中看到的最常見的錯位標籤是Windows CP-1252，標榜爲Latin-1。

2）盲目地去掉所有設置了最高位的字符（也就是說，字符值不在0到127之間），或者用?這樣的字符替換，假設真實編碼是一些8位代碼頁或其他，但你不在乎什麼。

來源

2013-06-26 09:41:59

我問過這樣的事情[here]（http://stackoverflow.com/questions/ 13155467 /如何-確實 - 鉻 - 建立最右字符編碼/ 14382394＃14382394），作爲對檢測到所述右字符集。我已經完成了這些程序，但是有些頁面粘貼了utf8和其他一些字符，所以我用最好的方式去做。 –

因此，方法2將爲'for（i = 0; i 127 &&（int）str [i] <0）str [i] ='？';}'？ –

將不起作用例如øin utf8有代碼（-61，-72），所以我也將它們去掉並添加2個qustion標記 –

基於什麼Steve Jessop說我已經結束了做糾正最常見的錯誤，並刪除其餘的功能。

#include<string> 
#include<iostream> 
using namespace std; 

string correct_non_utf_8(string *str) 
{ 
    int i,f_size=str->size(); 
    unsigned char c,c2,c3,c4; 
    string to; 
    to.reserve(f_size); 

    for(i=0 ; i<f_size ; i++){ 
     c=(unsigned char)(*str)[i]; 
     if(c<32){//control char 
      if(c==9 || c==10 || c==13){//allow only \t \n \r 
       to.append(1,c); 
      } 
      continue; 
     }else if(c<127){//normal ASCII 
      to.append(1,c); 
      continue; 
     }else if(c<160){//control char (nothing should be defined here either ASCI, ISO_8859-1 or UTF8, so skipping) 
      if(c2==128){//fix microsoft mess, add euro 
       to.append(1,226); 
       to.append(1,130); 
       to.append(1,172); 
      } 
      if(c2==133){//fix IBM mess, add NEL = \n\r 
       to.append(1,10); 
       to.append(1,13); 
      } 
      continue; 
     }else if(c<192){//invalid for UTF8, converting ASCII 
      to.append(1,(unsigned char)194); 
      to.append(1,c); 
      continue; 
     }else if(c<194){//invalid for UTF8, converting ASCII 
      to.append(1,(unsigned char)195); 
      to.append(1,c-64); 
      continue; 
     }else if(c<224 && i+1<f_size){//possibly 2byte UTF8 
      c2=(unsigned char)(*str)[i+1]; 
      if(c2>127 && c2<192){//valid 2byte UTF8 
       if(c==194 && c2<160){//control char, skipping 
        ; 
       }else{ 
        to.append(1,c); 
        to.append(1,c2); 
       } 
       i++; 
       continue; 
      } 
     }else if(c<240 && i+2<f_size){//possibly 3byte UTF8 
      c2=(unsigned char)(*str)[i+1]; 
      c3=(unsigned char)(*str)[i+2]; 
      if(c2>127 && c2<192 && c3>127 && c3<192){//valid 3byte UTF8 
       to.append(1,c); 
       to.append(1,c2); 
       to.append(1,c3); 
       i+=2; 
       continue; 
      } 
     }else if(c<245 && i+3<f_size){//possibly 4byte UTF8 
      c2=(unsigned char)(*str)[i+1]; 
      c3=(unsigned char)(*str)[i+2]; 
      c4=(unsigned char)(*str)[i+3]; 
      if(c2>127 && c2<192 && c3>127 && c3<192 && c4>127 && c4<192){//valid 4byte UTF8 
       to.append(1,c); 
       to.append(1,c2); 
       to.append(1,c3); 
       to.append(1,c4); 
       i+=3; 
       continue; 
      } 
     } 
     //invalid UTF8, converting ASCII (c>245 || string too short for multi-byte)) 
     to.append(1,(unsigned char)195); 
     to.append(1,c-64); 
    } 
    return to; 
}

引用：wikipedia，utf8-chartable.de

更新1

添加邊界檢查和改進一點點
仍在運行一些測試（謹慎使用）

來源

2013-08-20 12:34:03

當讀取多個字符時需要檢查字符串邊界 – pogorskiy

@pogorskiy我猜你指的是'（unsigned char）（* str）[i + 1]'，是的你是對的。我會更新。謝謝 –

從c/C++中的字符串去除無效的utf8

回答

更新1

相關問題