3
我正在編寫解析網頁的程序(其中一個我無法訪問,所以無法修改)。使用javax.xml.parsers.DocumentBuilder解析網頁時發生致命錯誤
首先我連接並使用getContent()獲取頁面的InputStream。那裏沒有問題。
但隨後在解析時:
public static int[] parseMoveGameList(InputStream is) throws ParserConfigurationException, IOException, SAXException {
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = dbf.newDocumentBuilder();
Document doc = builder.parse(is);
/*...*/
}
這裏builder.parse拋出:
org.xml.sax.SAXParseException; lineNumber: 3; columnNumber: 64; The system identifier must begin with either a single or double quote character.
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:253)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:288)
at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
at cs.ualberta.lgadapter.LGAdapter.parseMoveGameList(LGAdapter.java:78)
...
是我解析(但不能更改)頁面看起來像
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" >
<html>
<head>
<META http-equiv="Expires" content="0" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<!-- ... -->
</head>
<body>
<!-- ... -->
</body>
</html>
我該如何克服這個異常?
我不認爲這是一個好主意,使用XML解析器來解析HTML。 – Alex 2012-08-10 17:01:47
那我該用什麼? – dspyz 2012-08-10 17:04:42
http://stackoverflow.com/questions/9071568/parse-web-site-html-with-java – Alex 2012-08-10 17:07:02