準確的JSON文本編碼檢測

在RFC4627中描述了一種用於在不存在BOM時識別Unicode編碼的方法。這依賴於JSON文本中的前2個字符始終是ASCII字符。但在RFC7159中，規範將JSON文本定義爲「ws值ws」;這意味着單個字符串值也是有效的。因此，第一個字符將是開頭引號，但隨後字符串中允許的任何Unicode字符都可以跟隨。考慮到RFC7159也不鼓勵使用BOM;並且不再描述從前4個字節（字節）檢測編碼的過程，應該如何檢測它？正如RFC4627所述，UTF-32仍應該正常工作，因爲第一個字符是四個字節，應該仍然是ASCII，但UTF-16呢？第二個（2字節）字符可能不包含零字節以幫助識別正確的編碼。準確的JSON文本編碼檢測

來源

2015-10-21 Jannie Gerber

有趣的問題。當然，確定Unicode方案更具挑戰性。例如，ASCII值爲數字的單個字節也是有效的JSON：UTF-8中的單個數字。如何提出解決方案，我們可以討論您的解決方案？ – CouchDeveloper

恐怕我沒有一個可靠的建議。顯然，首先檢查物料清單以防萬一。接下來檢查UTF-32 - 如果前3個字節爲零，那麼它是UTF-32BE，否則如果第一個字節之後的3個字節爲零，則爲UTF-32LE。到目前爲止，我們應該能夠依靠測試，但現在出現了這個問題。爲了測試UTF-16，我們仍然需要查看4個字節。如果我們可以假定前兩個字符總是ASCII，那麼我們測試第一個和第三個字節是否爲UTF-16BE的零，以及UTF-16LE的第二個和第四個字節是否爲零。但是如果第二個字符可以是非ASCII的，那麼呢？ –

如果我爲您提供一個在C++中用於檢測編碼的實現的入門教程，它會有幫助嗎？據我所知你不需要BOM檢測 - 但我也有一個用C++實現的。 – CouchDeveloper

考慮看看，我在幾年前提出的實施後，我可以告訴大家，這是可以明確地從剛剛一個字符檢測給定的統一方案，給出如下假設：

輸入必須是Unicode
第一個字符必須爲ASCII
必須沒有BOM

有限公司考慮到這一點：

假設第一個字符是"["（0x5B） - 一個ASCII碼。然後，我們可以得到這些字節模式：

UTF_32LE: 5B 00 00 00 
UTF_32BE: 00 00 00 5B 
UTF_16LE: 5B 00 xx xx 
UTF_16BE: 00 5B xx xx 
UTF_8:  5B xx xx xx

其中「xx」可以是EOF或任何其它字節。

我們也應該注意到，根據RFC7159，最短有效的JSON可以只是一個字符。也就是說，它可能是1,2或4個字節 - 取決於Unicode方案。

所以，如果有幫助，這裏是C++的實現：

namespace json { 

    // 
    // Detect Encoding 
    // 
    // Tries to determine the Unicode encoding of the input starting at 
    // first. A BOM shall not be present (you might check with function 
    // json::unicode::detect_bom() whether there is a BOM, in which case 
    // you don't need to call this function when a BOM is present). 
    // 
    // Return values: 
    // 
    // json::unicode::UNICODE_ENCODING_UTF_8 
    // json::unicode::UNICODE_ENCODING_UTF_16LE 
    // json::unicode::UNICODE_ENCODING_UTF_16BE 
    // json::unicode::UNICODE_ENCODING_UTF_32LE 
    // json::unicode::UNICODE_ENCODING_UTF_32BE 
    // 
    // -1:  unexpected EOF 
    // -2:  unknown encoding 
    // 
    // Note: 
    // detect_encoding() requires to read ahead a few bytes in order to deter- 
    // mine the encoding. In case of InputIterators, this has the consequences 
    // that these iterators cannot be reused, for example for a parser. 
    // Usually, this requires to reset the istreambuff, that is using the 
    // member functions pubseekpos() or pupseekoff() in order to reset the get 
    // pointer of the stream buffer to its initial position. 
    // However, certain istreambuf implementations may not be able to set the  
    // stream pos at arbitrary positions. In this case, this method cannot be 
    // used and other edjucated guesses to determine the encoding may be 
    // needed. 

    template <typename Iterator>  
    inline int 
    detect_encoding(Iterator first, Iterator last) 
    { 
     // Assuming the input is Unicode! 
     // Assuming first character is ASCII! 

     // The first character must be an ASCII character, say a "[" (0x5B) 

     // UTF_32LE: 5B 00 00 00 
     // UTF_32BE: 00 00 00 5B 
     // UTF_16LE: 5B 00 xx xx 
     // UTF_16BE: 00 5B xx xx 
     // UTF_8:  5B xx xx xx 

     uint32_t c = 0xFFFFFF00; 

     while (first != last) { 
      uint32_t ascii; 
      if (static_cast<uint8_t>(*first) == 0) 
       ascii = 0; // zero byte 
      else if (static_cast<uint8_t>(*first) < 0x80) 
       ascii = 0x01; // ascii byte 
      else if (*first == EOF) 
       break; 
      else 
       ascii = 0x02; // non-ascii byte, that is a lead or trail byte 
      c = c << 8 | ascii; 
      switch (c) { 
        // reading first byte 
       case 0xFFFF0000: // first byte was 0 
       case 0xFFFF0001: // first byte was ASCII 
        ++first; 
        continue; 
       case 0xFFFF0002: 
        return -2; // this is bogus 

        // reading second byte 
       case 0xFF000000: // 00 00 
        ++first; 
        continue; 
       case 0xFF000001: // 00 01 
        return json::unicode::UNICODE_ENCODING_UTF_16BE; 
       case 0xFF000100: // 01 00 
        ++first; 
        continue; 
       case 0xFF000101: // 01 01 
        return json::unicode::UNICODE_ENCODING_UTF_8; 

        // reading third byte:  
       case 0x00000000: // 00 00 00 
       case 0x00010000: // 01 00 00 
        ++first; 
        continue;      
        //case 0x00000001: // 00 00 01 bogus 
        //case 0x00000100: // 00 01 00 na 
        //case 0x00000101: // 00 01 01 na 
       case 0x00010001: // 01 00 01 
        return json::unicode::UNICODE_ENCODING_UTF_16LE; 

        // reading fourth byte  
       case 0x01000000: 
        return json::unicode::UNICODE_ENCODING_UTF_32LE; 
       case 0x00000001: 
        return json::unicode::UNICODE_ENCODING_UTF_32BE; 

       default: 
        return -2; // could not determine encoding, that is, 
           // assuming the first byte is an ASCII. 
      } // switch 
     } // while 

     // premature EOF 
     return -1; 
    } 
}

來源

2015-10-21 15:39:09 CouchDeveloper

感謝您的代碼。我的原代碼遵循的程序在RFC4627，着眼於前4個字節描述如下： 00 00 00 XX UTF-32BE， 00 XX 00 XX UTF-16BE， XX 00 00 00 UTF-32LE， XX 00 xx 00 UTF-16LE， xx xx xx xx UTF-8。使用RFC 7159允許單個字符串值，當使用UTF-16時，可以得到像00 xx xx 00或xx 00 00 xx的東西。所以，我只是改變了我的如下： 00 00 00 XX UTF-32BE， XX 00 00 00 UTF-32LE， 00 XX UTF-16BE， XX 00 UTF-16LE，一切UTF-8。據我所見，你的代碼應該按原樣工作。 –

準確的JSON文本編碼檢測

回答

相關問題