XML規範定義了XML文檔中允許的Unicode字符的子集: http://www.w3.org/TR/REC-xml/#charsets。在Java中過濾非法XML字符
如何從Java中的字符串中篩選出這些字符?
簡單的測試案例:
Assert.equals("", filterIllegalXML(""+Character.valueOf((char) 2)))
XML規範定義了XML文檔中允許的Unicode字符的子集: http://www.w3.org/TR/REC-xml/#charsets。在Java中過濾非法XML字符
如何從Java中的字符串中篩選出這些字符?
簡單的測試案例:
Assert.equals("", filterIllegalXML(""+Character.valueOf((char) 2)))
找到XML的所有無效字符並不是微不足道的。你需要調用或者重新從Xerces的的XMLChar.isInvalid(),
http://kickjava.com/src/org/apache/xerces/util/XMLChar.java.htm
+1,很好找.. – Bozho 2010-05-24 13:53:04
該類很相關[閱讀:很難理解 - 無論如何感謝它的機器生成部分],以及要求實例化和預傳播64K CHARS數組... – rogerdpack 2014-12-09 21:16:49
使用StringEscapeUtils.escapeXml(xml)
從commons-lang會逃跑,不過濾的字符。
我已經使用這種方法來轉義實體(例如'<'到'<'),但那是不同的。該方法似乎沒有過濾任何非法字符。我的'測試用例'失敗了。 – 2010-05-24 13:06:37
顯示測試用例。 – Bozho 2010-05-24 13:09:25
如上所述: 'assertEquals(「」,StringEscapeUtils.escapeXml(「」+ Character.valueOf((char)2)));' – 2010-05-24 13:14:00
您可以使用regex (Regular Expression)做的工作,看到評論here
This page包括通過測試每個字符是否是規範中剝離出來invalid XML characters Java方法的例子,雖然它不檢查highly discouraged字符
順便說一句,轉義字符並不是解決方案,因爲XML 1.0和1.1規範不允許轉義形式的無效字符。
鏈接已死......它看起來也許這是新的URL? http://benjchristensen.com/2008/02/07/how-to-strip-invalid-xml-characters/ – Michael 2012-01-27 15:05:32
更新後的鏈接 - 謝謝 – 2012-01-28 01:03:32
這裏有一個解決方案,它負責將原料炭以及逃脫字符流中使用StAX或SAX的原理。它需要對其他無效字符延長,但你的想法
import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.io.Reader;
import java.io.UnsupportedEncodingException;
import java.io.Writer;
import org.apache.commons.io.IOUtils;
import org.apache.xerces.util.XMLChar;
public class IgnoreIllegalCharactersXmlReader extends Reader {
private final BufferedReader underlyingReader;
private StringBuilder buffer = new StringBuilder(4096);
private boolean eos = false;
public IgnoreIllegalCharactersXmlReader(final InputStream is) throws UnsupportedEncodingException {
underlyingReader = new BufferedReader(new InputStreamReader(is, "UTF-8"));
}
private void fillBuffer() throws IOException {
final String line = underlyingReader.readLine();
if (line == null) {
eos = true;
return;
}
buffer.append(line);
buffer.append('\n');
}
@Override
public int read(char[] cbuf, int off, int len) throws IOException {
if(buffer.length() == 0 && eos) {
return -1;
}
int satisfied = 0;
int currentOffset = off;
while (false == eos && buffer.length() < len) {
fillBuffer();
}
while (satisfied < len && buffer.length() > 0) {
char ch = buffer.charAt(0);
final char nextCh = buffer.length() > 1 ? buffer.charAt(1) : '\0';
if (ch == '&' && nextCh == '#') {
final StringBuilder entity = new StringBuilder();
// Since we're reading lines it's safe to assume entity is all
// on one line so next char will/could be the hex char
int index = 0;
char entityCh = '\0';
// Read whole entity
while (entityCh != ';') {
entityCh = buffer.charAt(index++);
entity.append(entityCh);
}
// if it's bad get rid of it and clean it from the buffer and point to next valid char
if (entity.toString().equals("")) {
buffer.delete(0, entity.length());
continue;
}
}
if (XMLChar.isValid(ch)) {
satisfied++;
cbuf[currentOffset++] = ch;
}
buffer.deleteCharAt(0);
}
return satisfied;
}
@Override
public void close() throws IOException {
underlyingReader.close();
}
public static void main(final String[] args) {
final File file = new File(
<XML>);
final File outFile = new File(file.getParentFile(), file.getName()
.replace(".xml", ".cleaned.xml"));
Reader r = null;
Writer w = null;
try {
r = new IgnoreIllegalCharactersXmlReader(new FileInputStream(file));
w = new OutputStreamWriter(new FileOutputStream(outFile),"UTF-8");
IOUtils.copyLarge(r, w);
w.flush();
} catch (Exception e) {
e.printStackTrace();
} finally {
IOUtils.closeQuietly(r);
IOUtils.closeQuietly(w);
}
}
}
鬆散的基礎上,從斯蒂芬C'S答案的鏈接comment,和維基百科的XML 1.1 spec這裏將告訴您如何刪除Java方法使用正則表達式替換的非法字符:
boolean isAllValidXmlChars(String s) {
// xml 1.1 spec http://en.wikipedia.org/wiki/Valid_characters_in_XML
if (!s.matches("[\\u0001-\\uD7FF\\uE000-\uFFFD\\x{10000}-\\x{10FFFF}]")) {
// not in valid ranges
return false;
}
if (s.matches("[\\u0001-\\u0008\\u000b-\\u000c\\u000E-\\u001F\\u007F-\\u0084\\u0086-\\u009F]")) {
// a control character
return false;
}
// "Characters allowed but discouraged"
if (s.matches(
"[\\uFDD0-\\uFDEF\\x{1FFFE}-\\x{1FFFF}\\x{2FFFE}–\\x{2FFFF}\\x{3FFFE}–\\x{3FFFF}\\x{4FFFE}–\\x{4FFFF}\\x{5FFFE}-\\x{5FFFF}\\x{6FFFE}-\\x{6FFFF}\\x{7FFFE}-\\x{7FFFF}\\x{8FFFE}-\\x{8FFFF}\\x{9FFFE}-\\x{9FFFF}\\x{AFFFE}-\\x{AFFFF}\\x{BFFFE}-\\x{BFFFF}\\x{CFFFE}-\\x{CFFFF}\\x{DFFFE}-\\x{DFFFF}\\x{EFFFE}-\\x{EFFFF}\\x{FFFFE}-\\x{FFFFF}\\x{10FFFE}-\\x{10FFFF}]"
)) {
return false;
}
return true;
}
爲什麼你得到這些「非法」XML字符? 一旦你發現它們,你想怎麼做?刪除?更換? – 2010-05-24 13:11:59
@RH:忽略它們就足夠了。最好的解決辦法是刪除它們並獲得某種報告。這樣我可以記錄警告。 – 2010-05-24 13:15:47
如果有人想知道我使用Xerces的'XMLChar',正如ZZ Coder所建議的那樣。你可以在這裏找到整個方法:http://pastebin.com/6Vbm1zuC – 2010-05-25 06:15:58