2015-09-26 64 views
0

我有一個文本文檔,其中每行是整個美國專利XML文檔。我試圖解析它以刪除某些功能,如專利號等。我之前沒有使用過XPath,所以我借用了一些代碼,我從Ravi Thapliyal 找到Parse XML Simple String using Java XPath。但是,顯然最初的!DOCTYPE標記正在導致DocumentBuilder嘗試在某處查找實際文檔?XPath代碼創建IOException

這是我在代碼第一次嘗試:

//convert entire file to ArrayList of strings 
     ArrayList<String> doc = new ArrayList<>(); 
     while(input.hasNext()){ 
      doc.add(input.nextLine().trim()); 
     } 

int index = 0; 
    while(index < doc.size()){ 
     String xml = doc.get(index); 
     XPathFactory xpathFactory = XPathFactory.newInstance(); 
     XPath xPath = xpathFactory.newXPath(); 
     InputSource source = new InputSource(new StringReader(xml)); 

     db.setEntityResolver(new EntityResolver() { 
      public InputSource resolveEntity(String publicId, String systemId) 
      throws SAXException, java.io.IOException { 
       if (systemId.contains("us-patent-grant-v40-2004-12-02.dtd")) { 
      return new InputSource(new StringReader("")); 
     } else { 
      return null; 
     } 
      } 
     }); 

     String orgName = ""; 
     try { 
      orgName = (String) xPath.evaluate("/agents/adressbook/orgname", source,XPathConstants.STRING); 
     } catch (Exception e) { 
      e.printStackTrace(); 
     } 

     System.out.println("Document #" + index + " Company: " + orgName); 
    }//end while loop that goes through each line (patent document) in file 

輸入文件每行開頭包含DOCTYPE聲明後,執行以下操作: 美國專利補助金制度「美國專利grant- v40-2004-12-02.dtd」 []>

引起該問題(91)的行是:

orgName = (String) xPath.evaluate("/agents/adressbook/orgname", 
     source,XPathConstants.STRING); 

而且堆棧跟蹤是:

java.io.FileNotFoundException: C:\Users\Dave\Documents\NetBeansProjects\ParseXML\us-patent-grant-v40-2004-12-02.dtd (The system cannot find the file specified) 
    at java.io.FileInputStream.open(Native Method) 
    at java.io.FileInputStream.<init>(FileInputStream.java:131) 
    at java.io.FileInputStream.<init>(FileInputStream.java:87) 
    at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:90) 
    at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:188) 
    at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:616) 
Document #0 Company: 
    at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startEntity(XMLEntityManager.java:1293) 
    at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startDTDEntity(XMLEntityManager.java:1260) 
    at com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.setInputSource(XMLDTDScannerImpl.java:263) 
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.dispatch(XMLDocumentScannerImpl.java:1164) 
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.next(XMLDocumentScannerImpl.java:1050) 
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:938) 
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606) 
    at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:117) 
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510) 
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848) 
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777) 
    at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141) 
    at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:243) 
    at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:348) 
    at com.sun.org.apache.xpath.internal.jaxp.XPathImpl.evaluate(XPathImpl.java:466) 
    at Parser.main(Parser.java:102) 
--------------- linked to ------------------ 
javax.xml.xpath.XPathExpressionException: java.io.FileNotFoundException: C:\Users\Dave\Documents\NetBeansProjects\ParseXML\us-patent-grant-v40-2004-12-02.dtd (The system cannot find the file specified) 
    at com.sun.org.apache.xpath.internal.jaxp.XPathImpl.evaluate(XPathImpl.java:473) 
    at Parser.main(Parser.java:102) 
Caused by: java.io.FileNotFoundException: C:\Users\Dave\Documents\NetBeansProjects\ParseXML\us-patent-grant-v40-2004-12-02.dtd (The system cannot find the file specified) 
    at java.io.FileInputStream.open(Native Method) 
    at java.io.FileInputStream.<init>(FileInputStream.java:131) 
    at java.io.FileInputStream.<init>(FileInputStream.java:87) 
    at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:90) 
    at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:188) 
    at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:616) 
    at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startEntity(XMLEntityManager.java:1293) 
    at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startDTDEntity(XMLEntityManager.java:1260) 
    at com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.setInputSource(XMLDTDScannerImpl.java:263) 
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.dispatch(XMLDocumentScannerImpl.java:1164) 
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.next(XMLDocumentScannerImpl.java:1050) 
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:938) 
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606) 
    at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:117) 
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510) 
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848) 
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777) 
    at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141) 
    at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:243) 
    at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:348) 
    at com.sun.org.apache.xpath.internal.jaxp.XPathImpl.evaluate(XPathImpl.java:466) 

有人能幫我弄清楚我應該怎麼做來解析一個字符串中的文檔?

回答

1

嘗試設置功能,或者提供空的EntityResolver

對於你需要找什麼解析器實現你使用的功能(它們是具體的實現)

Make DocumentBuilder.parse ignore DTD references

+0

我試過了,仍然得到相同的錯誤。我已經改變了原來的問題,按照要求顯示新的代碼和堆棧跟蹤。謝謝。 –

+0

你有沒有嘗試從鏈接builder.setEntityResolver代碼? – Vovka

+0

是的,我只是試過,並得到完全相同的堆棧跟蹤。 –

0

您是否嘗試過供應DTD文件是試圖參考,例如從us-patent-application-v40-2004-12-02.dtd下載它?

您可以嘗試將此文件放在與XML相同的文件夾中;或者在解析過程的當前目錄中(因爲你很急)。