從HTTP輸入流構建Javax xml解析器時卡住

我試圖打開一個到網站的HTTP連接並將html解析爲org.w3c.dom.Document類。我可以打開HTTP連接和輸出網頁到控制檯就好了，但如果我通過了InputStream對象到XML解析器，它掛起了一分鐘，輸出錯誤從HTTP輸入流構建Javax xml解析器時卡住

[Fatal Error] :108:55: Open quote is expected for attribute "{1}" associated with an element type "onload".

代碼：

private static Document getInputStream(String url) throws IOException, SAXException, ParserConfigurationException 
{ 
    System.out.println(url); 
    URL webUrl = new URL(url); 
    URLConnection connection = webUrl.openConnection(); 
    connection.setConnectTimeout(60 * 1000); 
    connection.setReadTimeout(60 * 1000); 

    InputStream stream = connection.getInputStream(); 

    DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance(); 
    domFactory.setNamespaceAware(true); 
    DocumentBuilder builder = domFactory.newDocumentBuilder(); 
    Document doc = builder.parse(stream); // This line is hanging 
    return doc; 
}

暫停時

堆棧跟蹤：

Thread [main] (Suspended) 
    SocketInputStream.socketRead0(FileDescriptor, byte[], int, int, int) line: not available [native method]  
    SocketInputStream.read(byte[], int, int) line: not available  
    BufferedInputStream.fill() line: not available 
    BufferedInputStream.read1(byte[], int, int) line: not available 
    BufferedInputStream.read(byte[], int, int) line: not available 
    HttpClient.parseHTTPHeader(MessageHeader, ProgressSource, HttpURLConnection) line: not available  
    HttpClient.parseHTTP(MessageHeader, ProgressSource, HttpURLConnection) line: not available 
    HttpURLConnection.getInputStream() line: not available 
    XMLEntityManager.setupCurrentEntity(String, XMLInputSource, boolean, boolean) line: not available 
    XMLEntityManager.startEntity(String, XMLInputSource, boolean, boolean) line: not available 
    XMLEntityManager.startDTDEntity(XMLInputSource) line: not available 
    XMLDTDScannerImpl.setInputSource(XMLInputSource) line: not available  
    XMLDocumentScannerImpl$DTDDriver.dispatch(boolean) line: not available 
    XMLDocumentScannerImpl$DTDDriver.next() line: not available 
    XMLDocumentScannerImpl$PrologDriver.next() line: not available 
    XMLNSDocumentScannerImpl(XMLDocumentScannerImpl).next() line: not available 
    XMLNSDocumentScannerImpl.next() line: not available 
    XMLNSDocumentScannerImpl(XMLDocumentFragmentScannerImpl).scanDocument(boolean) line: not available 
    XIncludeAwareParserConfiguration(XML11Configuration).parse(boolean) line: not available 
    XIncludeAwareParserConfiguration(XML11Configuration).parse(XMLInputSource) line: not available 
    DOMParser(XMLParser).parse(XMLInputSource) line: not available 
    DOMParser.parse(InputSource) line: not available  
    DocumentBuilderImpl.parse(InputSource) line: not available 
    DocumentBuilderImpl(DocumentBuilder).parse(InputStream) line: not available 
    MSCommunicator.getInputStream(String) line: 45 
    MSCommunicator.getGamePageFromForum(int, int, int) line: 70 
    MSCommunicator.getGamePageFromForum(int, int) line: 57 
    Game.<init>(int, int) line: 21 
    MSCommunicator.main(String[]) line: 26

來源

2012-10-15 Akron

你真的不能只是指望解析HTML轉換爲XML DOM樹。它不一定是有效的XML。您可能需要先清理它。看到這個問題的答案：

Reading HTML file to DOM tree using Java

來源

2012-10-15 07:53:31 artbristol

即使你獲得的HTML頁面是正確和良好的HTML，它可能不是格式良好的XML。對於〔實施例，這是有效的HTML4：

<p class=myclass>Paragraph<br>Next line</p>

而在XML（XHTML），這被認爲是有效的：

<p class="myclass">Paragraph<br/>Next line</p>

注意關閉<br/>標籤和周圍的p標籤的類屬性報價。

另外，互聯網是一個狂野的地方，所以內容不太可能是完美的，這就是爲什麼你需要'一切都用一粒鹽' - 即使格式良好，所以你將不得不使用一個HTML整理器，如jTidy或nekoHTML。

來源

2012-10-15 08:00:59 ppeterka

從HTTP輸入流構建Javax xml解析器時卡住

回答

相關問題