有沒有解碼編碼的unicode utf-8字符串的函數？

我想存儲一些html格式和Rebol cgi的數據。我的形式如下：有沒有解碼編碼的unicode utf-8字符串的函數？

<form action="test.cgi" method="post" > 

    Input: 

    <input type="text" name="field"/> 
    <input type="submit" value="Submit" /> 

</form>

但對於Unicode字符喜歡中國，我得到百分號數據的編碼形式，例如%E4%BA%BA。

（這是中國字符「人」 ......其UTF-8格式的二進制的Rebol文字是#{E4BABA}）

是否有系統的功能，或現有的庫，可以直接解碼這個？ dehex目前似乎沒有涵蓋這種情況。我目前通過刪除百分號，構建相應的二進制，這樣的手動解碼這樣的：

data: to-string read system/ports/input 
print data 

;-- this prints "field=%E4%BA%BA" 

k-v: parse data "=" 
print k-v 

;-- this prints ["field" "%E4%BA%BA"] 

v: append insert replace/all k-v/2 "%" "" "#{" "}" 
print v 

;-- This prints "#{E4BABA}" ... a string!, not binary! 
;-- LOAD will help construct the corresponding binary 
;-- then TO-STRING will decode that binary from UTF-8 to character codepoints 

write %test.txt to-string load v

來源

2013-08-20 Wayne Cui

我有一個庫調用AltWebForm即百分比編碼Web表單數據連接/解碼：

do http://reb4.me/r3/altwebform 
load-webform "field=%E4%BA%BA"

庫在這裏描述：Rebol and Web Forms。

來源

2013-08-20 15:47:40 rgchris

看起來是與票＃1986年，在那裏討論這是否是一個「錯誤」或互聯網從在其自己的規格改變了：

Have DEHEX convert UTF-8 sequences from browsers as Unicode。

如果您有什麼已經成爲中文標準的特定經驗，並且想要權衡，那將是有價值的。

正如順便說一句，在特定情況下，上面的程序可以在PARSE處理交替爲：

key-value: {field=%E4%BA%BA} 

utf8-bytes: copy #{} 

either parse key-value [ 
    copy field-name to {=} 
    skip 
    some [ 
     and {%} 
     copy enhexed-byte 3 skip (
      append utf8-bytes dehex enhexed-byte 
     ) 
    ] 
] [ 
    print [field-name {is} to string! utf8-bytes] 
] [ 
    print {Malformed input.} 
]

這將輸出：

field is 人

隨着一些評論包括：

key-value: {field=%E4%BA%BA} 

;-- Generate empty binary value by copying an empty binary literal  
utf8-bytes: copy #{} 

either parse key-value [ 

    ;-- grab field-name as the chars right up to the equals sign 
    copy field-name to {=} 

    ;-- skip the equal sign as we went up to it, without moving "past" it 
    skip 

    ;-- apply the enclosed rule SOME (non-zero) number of times 
    some [ 
     ;-- match a percent sign as the immediate next symbol, without 
     ;-- advancing the parse position 
     and {%} 

     ;-- grab the next three chars, starting with %, into enhexed-byte 
     copy enhexed-byte 3 skip (

      ;-- If we get to this point in the match rule, this parenthesized 
      ;-- expression lets us evaluate non-dialected Rebol code to 
      ;-- append the dehexed byte to our utf8 binary 
      append utf8-bytes dehex enhexed-byte 
     ) 
    ] 
] [ 
    print [field-name {is} to string! utf8-bytes] 
] [ 
    print {Malformed input.} 
]

（請注意，「簡單的解析」是獲得斧頭贊成enhancements to SPLIT。因此，像parse data "="編寫代碼，現在可以替代，如果你檢查出來表示爲split data "="，或其他清涼變種......樣本中票。）

來源

2013-08-20 17:29:19 HostileFork

這種方法是更加明朗化。使用'load v'構造二進制文件並不自然。 http://curecode.org/中的兩個鏈接非常棒。我會更仔細地閱讀它們。你的代碼中有一個小錯誤，或者我的版本不支持它？代碼'{％} -1 skip'無法在我的控制檯中工作（腳本錯誤：值超出範圍：-1）。我將它更改爲「{％}」，它可以工作。最後，非常感謝格式和重組。 –

@WayneTsui沒有問題，對錯誤抱歉...我必須從一個版本複製我試過，我認爲工作，但沒有。使用TO的一個問題是，它會將解析位置提前到該規則......因此會接受像「field = x123 \ abc％E4BA％BA」這樣格式不正確的輸入。我會研究如何正確地向後跳躍，但是'AND'''''''''''''''''''''''''''''''''''''''''''''''''''''''我會研究如何正確地向後跳躍 – HostileFork

有沒有解碼編碼的unicode utf-8字符串的函數？

回答

相關問題