2010-05-24 97 views
8

XML規範定義了XML文檔中允許的Unicode字符的子集: http://www.w3.org/TR/REC-xml/#charsets在Java中過濾非法XML字符

如何從Java中的字符串中篩選出這些字符?

簡單的測試案例:

Assert.equals("", filterIllegalXML(""+Character.valueOf((char) 2))) 
+0

爲什麼你得到這些「非法」XML字符? 一旦你發現它們,你想怎麼做?刪除?更換? – 2010-05-24 13:11:59

+0

@RH:忽略它們就足夠了。最好的解決辦法是刪除它們並獲得某種報告。這樣我可以記錄警告。 – 2010-05-24 13:15:47

+0

如果有人想知道我使用Xerces的'XMLChar',正如ZZ Coder所建議的那樣。你可以在這裏找到整個方法:http://pastebin.com/6Vbm1zuC – 2010-05-25 06:15:58

回答

5

找到XML的所有無效字符並不是微不足道的。你需要調用或者重新從Xerces的的XMLChar.isInvalid(),

http://kickjava.com/src/org/apache/xerces/util/XMLChar.java.htm

+0

+1,很好找.. – Bozho 2010-05-24 13:53:04

+0

該類很相關[閱讀:很難理解 - 無論如何感謝它的機器生成部分],以及要求實例化和預傳播64K CHARS數組... – rogerdpack 2014-12-09 21:16:49

0

使用StringEscapeUtils.escapeXml(xml)commons-lang會逃跑,不過濾的字符。

+2

我已經使用這種方法來轉義實體(例如'<'到'<'),但那是不同的。該方法似乎沒有過濾任何非法字符。我的'測試用例'失敗了。 – 2010-05-24 13:06:37

+2

顯示測試用例。 – Bozho 2010-05-24 13:09:25

+0

如上所述: 'assertEquals(「」,StringEscapeUtils.escapeXml(「」+ Character.valueOf((char)2)));' – 2010-05-24 13:14:00

1

This page包括通過測試每個字符是否是規範中剝離出來invalid XML characters Java方法的例子,雖然它不檢查highly discouraged字符

順便說一句,轉義字符並不是解決方案,因爲XML 1.0和1.1規範不允許轉義形式的無效字符。

+1

鏈接已死......它看起來也許這是新的URL? http://benjchristensen.com/2008/02/07/how-to-strip-invalid-xml-characters/ – Michael 2012-01-27 15:05:32

+0

更新後的鏈接 - 謝謝 – 2012-01-28 01:03:32

0

這裏有一個解決方案,它負責將原料炭以及逃脫字符流中使用StAX或SAX的原理。它需要對其他無效字符延長,但你的想法

import java.io.BufferedReader; 
import java.io.File; 
import java.io.FileInputStream; 
import java.io.FileOutputStream; 
import java.io.IOException; 
import java.io.InputStream; 
import java.io.InputStreamReader; 
import java.io.OutputStreamWriter; 
import java.io.Reader; 
import java.io.UnsupportedEncodingException; 
import java.io.Writer; 

import org.apache.commons.io.IOUtils; 
import org.apache.xerces.util.XMLChar; 

public class IgnoreIllegalCharactersXmlReader extends Reader { 

    private final BufferedReader underlyingReader; 
    private StringBuilder buffer = new StringBuilder(4096); 
    private boolean eos = false; 

    public IgnoreIllegalCharactersXmlReader(final InputStream is) throws UnsupportedEncodingException { 
     underlyingReader = new BufferedReader(new InputStreamReader(is, "UTF-8")); 
    } 

    private void fillBuffer() throws IOException { 
     final String line = underlyingReader.readLine(); 
     if (line == null) { 
      eos = true; 
      return; 
     } 
     buffer.append(line); 
     buffer.append('\n'); 
    } 

    @Override 
    public int read(char[] cbuf, int off, int len) throws IOException { 
     if(buffer.length() == 0 && eos) { 
      return -1; 
     } 
     int satisfied = 0; 
     int currentOffset = off; 
     while (false == eos && buffer.length() < len) { 
      fillBuffer(); 
     } 
     while (satisfied < len && buffer.length() > 0) { 
      char ch = buffer.charAt(0); 
      final char nextCh = buffer.length() > 1 ? buffer.charAt(1) : '\0'; 
      if (ch == '&' && nextCh == '#') { 
    final StringBuilder entity = new StringBuilder(); 
    // Since we're reading lines it's safe to assume entity is all 
    // on one line so next char will/could be the hex char 
    int index = 0; 
    char entityCh = '\0'; 
    // Read whole entity 
    while (entityCh != ';') { 
     entityCh = buffer.charAt(index++); 
     entity.append(entityCh); 
    } 
    // if it's bad get rid of it and clean it from the buffer and point to next valid char 
    if (entity.toString().equals("&#2;")) { 
     buffer.delete(0, entity.length()); 
     continue; 
    } 
      } 
      if (XMLChar.isValid(ch)) { 
    satisfied++; 
    cbuf[currentOffset++] = ch; 
      } 
      buffer.deleteCharAt(0); 
     } 
     return satisfied; 
    } 

    @Override 
    public void close() throws IOException { 
     underlyingReader.close(); 
    } 

    public static void main(final String[] args) { 
     final File file = new File(
    <XML>); 
     final File outFile = new File(file.getParentFile(), file.getName() 
    .replace(".xml", ".cleaned.xml")); 
     Reader r = null; 
     Writer w = null; 
     try { 
      r = new IgnoreIllegalCharactersXmlReader(new FileInputStream(file)); 
      w = new OutputStreamWriter(new FileOutputStream(outFile),"UTF-8"); 
      IOUtils.copyLarge(r, w); 
      w.flush(); 
     } catch (Exception e) { 
      e.printStackTrace(); 
     } finally { 
      IOUtils.closeQuietly(r); 
      IOUtils.closeQuietly(w); 
     } 
    } 
} 
0

鬆散的基礎上,從斯蒂芬C'S答案的鏈接comment,和維基百科的XML 1.1 spec這裏將告訴您如何刪除Java方法使用正則表達式替換的非法字符:

boolean isAllValidXmlChars(String s) { 
    // xml 1.1 spec http://en.wikipedia.org/wiki/Valid_characters_in_XML 
    if (!s.matches("[\\u0001-\\uD7FF\\uE000-\uFFFD\\x{10000}-\\x{10FFFF}]")) { 
    // not in valid ranges 
    return false; 
    } 
    if (s.matches("[\\u0001-\\u0008\\u000b-\\u000c\\u000E-\\u001F\\u007F-\\u0084\\u0086-\\u009F]")) { 
    // a control character 
    return false; 
    } 

    // "Characters allowed but discouraged" 
    if (s.matches(
    "[\\uFDD0-\\uFDEF\\x{1FFFE}-\\x{1FFFF}\\x{2FFFE}–\\x{2FFFF}\\x{3FFFE}–\\x{3FFFF}\\x{4FFFE}–\\x{4FFFF}\\x{5FFFE}-\\x{5FFFF}\\x{6FFFE}-\\x{6FFFF}\\x{7FFFE}-\\x{7FFFF}\\x{8FFFE}-\\x{8FFFF}\\x{9FFFE}-\\x{9FFFF}\\x{AFFFE}-\\x{AFFFF}\\x{BFFFE}-\\x{BFFFF}\\x{CFFFE}-\\x{CFFFF}\\x{DFFFE}-\\x{DFFFF}\\x{EFFFE}-\\x{EFFFF}\\x{FFFFE}-\\x{FFFFF}\\x{10FFFE}-\\x{10FFFF}]" 
)) { 
    return false; 
    } 

    return true; 
}