lua cjson無法解碼特定的Unicode字符？

我從Lua cjson試圖解碼一個特定的Unicode字符時，按照源收到以下錯誤，lua cjson無法解碼特定的Unicode字符？

[email protected]:~/torch-rnn# th train.lua -input_h5 data/aud.h5 -input_json data/aud.json -batch_size 50 -seq_length 100 -rnn_size 256 -max_epochs 50 
Running with CUDA on GPU 0 
/root/torch/install/bin/luajit: train.lua:77: Expected value but found invalid unicode escape code at character 350873 
stack traceback: 
    [C]: in function 'read_json' 
    train.lua:77: in main chunk 
    [C]: in function 'dofile' 
    /root/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk 
    [C]: at 0x00406670

，我可以看到train.lua read_json在幕後用cjson。

有問題的Unicode換碼\ uda85

如果我去https://www.branah.com/unicode-converter它告訴我應該逃跑解碼的字符。

unicode轉義是使用python unichr（55941）生成的，並通過python腳本輸出的重定向被寫入到PYTHONIOENCODING = UTF-8的文件中。

以下演示如何生成字符;

echo "print unichr(55941)" > test.py 
python test.py 
Traceback (most recent call last): 
    File "test.py", line 1, in <module> 
    print unichr(55941) 
UnicodeEncodeError: 'ascii' codec can't encode character u'\uda85' in position 0: ordinal not in range(128) 

# export PYTHONIOENCODING=UTF-8 
# python test.py 
��� 
# python test.py > tfile 
# cat tfile 
��� 
# python 
Python 2.7.6 (default, Jun 22 2015, 17:58:13) 
[GCC 4.8.2] on linux2 
Type "help", "copyright", "credits" or "license" for more information. 
>>> f=open("tfile",'r') 
>>> s=f.readline() 
>>> s 
'\xed\xaa\x85\n' 
>>> print s 
��� 

>>> s.decode('utf-8') 
u'\uda85\n'

什麼我試圖做整體是取一個整數集0-65535範圍內，並使用Python它們映射到UTF-8字符，並將其寫入到文件中。然後我想使用torch-rnn，它使用LUA在字符序列上訓練一個RNN。我試圖運行train.lua上的火炬rnn python腳本生成的文件/ preprocess.py

來源

2016-09-04 Matt Warren

'\ uda85'是替代對的第一個代碼，第一個代碼後面必須有第二個代碼（dc00-dfff）才能完成unicode字符。沒有第二部分的第一部分是錯誤。 –

啊有趣，謝謝。你知道一個只顯示所有代理對的列表嗎？在這個應用程序中，我可以簡單地將它們切換到不同的值，所以我可以對它們進行硬編碼檢查。 - 另外，出於興趣，當我只給它解碼時，如何將解碼站點鏈接到生成有效的字符？ –

@MattWarren領先或「高代理」範圍是D800-DBFF，尾隨或「低代理」範圍是DC00-DFFF。請參閱https://en.wikipedia.org/wiki/Universal_Character_Set_characters#Surrogates – Leon

似乎問題是unicode代理人，理解這意味着我可以過濾/切換它們爲不同的價值。在這個用例中，那不是一個很大的問題。

來源

2016-09-04 19:17:01

lua cjson無法解碼特定的Unicode字符？

回答

相關問題