python
  • xml
  • unicode
  • utf-8
  • 2017-04-06 64 views 1 likes 
    1

    xml屬性(標記)中的無效unicode字符列表是什麼?xml屬性/標記中的unicode字符無效

    如下面的python3代碼說明:

    import xml.etree.ElementTree as ET 
    from io import StringIO as sio 
    
    xml_dec = '<?xml version="1.1" encoding="UTF-8"?>' 
    unicode_text = '<root>textº</root>' 
    valid_unicode = '<標籤 屬性="值">文字</標籤>' 
    invalid_unicode_attribute = '<tag attributeº="value">text</tag>' 
    invalid_unicode_tag = '<tagº>text</tagº>' 
    
    ET.parse(sio(xml_dec + unicode_text)) 
    # works 
    
    ET.parse(sio(xml_dec + valid_unicode)) 
    # works 
    
    ET.parse(sio(xml_dec + invalid_unicode_attribute)) 
    # ParseError 
    
    ET.parse(sio(xml_dec + invalid_unicode_tag)) 
    # ParseError 
    

    的unicode字符º,即U+00BA,可以如果是在該元件的文字,而不是在元件屬性或者標籤進行解析。另一方面,可以在元素屬性和標籤中解析其他Unicode字符,例如中文字符。

    我檢查了XML <?xml version="1.1" encoding="UTF-8"?><tagº>text</tagº>https://validator.w3.org/check,並給出了錯誤:

    Line 1, Column 43: character "º" not allowed in attribute specification list

    然而,在XML Recommendation 1.1, §2.2 Characters,它說,這是允許的:

    Char ::= [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

    我的問題是,在那裏我可以找到XML屬性/標籤中的無效unicode字符列表?

    +0

    這是關於屬性現在或標籤名稱?標題和最後一句談論屬性,但這些示例僅關於文本和標籤。 – lenz

    +2

    無論如何,你只需要在自己鏈接的文檔中滾動一下。例如,[here](https://www.w3.org/TR/xml11/#NT-NameStartChar)是您可以在標籤名稱中使用哪些字符的定義。 – lenz

    回答

    2

    有關允許在標籤文字和屬性名的W3C recommendation(要聯繫自己 - 但你在看什麼可以在文本節點中使用的定義),規定如下:

    Almost all characters are permitted in names, except those which either are or reasonably could be used as delimiters.

    Document authors are encouraged to use names which are meaningful words or combinations of words in natural languages, and to avoid symbolic or white space characters in names. Note that COLON, HYPHEN-MINUS, FULL STOP (period), LOW LINE (underscore), and MIDDLE DOT are explicitly permitted.

    The ASCII symbols and punctuation marks, along with a fairly large group of Unicode symbol characters, are excluded from names because they are more useful as delimiters in contexts where XML names are used outside XML documents; providing this group gives those contexts hard guarantees about what cannot be part of an XML name.

    其次是一個formal definition其中列出了很多的Unicode範圍:

    NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | 
            [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | 
            [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | 
            [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | 
            [#x10000-#xEFFFF] 
    NameChar  ::= NameStartChar | "-" | "." | [0-9] | #xB7 | 
            [#x0300-#x036F] | [#x203F-#x2040] 
    Name   ::= NameStartChar (NameChar)* 
    

    陽性序數指示器º#xBA)不在其中,無論出於何種原因(至少某些語言在縮寫中使用它來表示常見詞彙,所以它看起來不像「分隔符」)。

    看到您可以在標籤名稱中使用數字,連字符和句點,但不能作爲第一個字符也很有趣。

    相關問題