2010-08-16 79 views
0

我正在一個項目中,我需要從gz文件中使用apache tika [AM NEW TO TIKA]提取xml(站點地圖)數據。 fie的名字是類似sitemap01.xml.gz 我可以從正常的文本文件或html中提取數據,但我不知道如何從gz中提取xml並從xml中提取meta和數據... 我搜索過谷歌過去兩天。使用apache tika從gzip文件中提取xml數據?

我需要使用tika中的delegateParser從xml中提取數據嗎? 請指引我一些樣品或物品....

這裏是我的嘗試

public void parseXml() throws IOException{ 
    Metadata metadata = new Metadata(); 
    ContentHandler handler = new BodyContentHandler(); 
    Parser parser = new AutoDetectParser(); 
    ParseContext context = new ParseContext(); 
    InputStream stream =this.getClass().getResourceAsStream("sitemap.xml.gz"); 
    try { 
     parser.parse(stream,handler,metadata,context); 
     for(int i = 0; i <metadata.names().length; i++) { 
      String name = metadata.names()[i]; 
      System.out.println(name + " : " + metadata.get(name)); 
      } 
     System.out.println(handler.toString()); 

    } catch (IOException e) { 
     // TODO Auto-generated catch block 
     e.printStackTrace(); 
    } catch (SAXException e) { 
     // TODO Auto-generated catch block 
     e.printStackTrace(); 
    } catch (TikaException e) { 
     // TODO Auto-generated catch block 
     e.printStackTrace(); 
    }finally{ 
     if(stream!=null) { 
       stream.close(); 
      } 
    } 


} 

回答

1

你缺少的是設置在您的ParseContext一個遞歸解析器的事情。你可能想是這樣的:

Parser parser = new AutoDetectParser(); 
ParseContext context = new ParseContext(); 
context.set(Parser.class, parser); 
parser.parse(....) 

通過設置在ParseContext解析器,你告訴提卡調用,當它遇到嵌入文檔(如您的GZip裏面的XML)

0

這裏是你如何能使用來自Apache Tika的XML解析器:

//detecting the file type 
    BodyContentHandler handler = new BodyContentHandler(-1); 
    Metadata metadata = new Metadata(); 
    File inFile = new File("sitemap.xml.gz"); 
    System.out.println(inFile.isFile()); 
    FileInputStream inputstream = new FileInputStream(inFile); 
    ParseContext pcontext = new ParseContext(); 

    //Xml parser 
    XMLParser xmlparser = new XMLParser(); 
    xmlparser.parse(inputstream, handler, metadata, pcontext); 
    System.out.println(pcontext.toString()); 

    System.out.println("Contents of the document:" + handler.toString());//this one contains all contents from xml files and tags are also removed 
    System.out.println("Metadata of the document:"); 
    String[] metadataNames = metadata.names(); 

    for(String name : metadataNames) { 
    System.out.println(name + ": " + metadata.get(name)); 
相關問題