2012-09-25 84 views
4

尊敬的用戶我正在使用apache lucene進行索引和搜索。 我必須索引存儲在本地計算機光盤上的html文件。我必須對html文件的文件名和內容進行索引。我能夠將文件名存儲在lucene索引中,但不能存儲html文件內容,該文件內容不僅應該索引數據,還會索引整個頁面組成的圖像鏈接和url,以及如何從索引文件 訪問內容以索引我使用下面的代碼:html文件的lucene索引

File indexDir = new File(indexpath); 
    File dataDir = new File(datapath); 
    String suffix = ".htm"; 
    IndexWriter indexWriter = new IndexWriter(
      FSDirectory.open(indexDir), 
      new SimpleAnalyzer(), 
      true, 
      IndexWriter.MaxFieldLength.LIMITED); 
    indexWriter.setUseCompoundFile(false); 
    indexDirectory(indexWriter, dataDir, suffix); 

    numIndexed = indexWriter.maxDoc(); 
    indexWriter.optimize(); 
    indexWriter.close(); 


private void indexDirectory(IndexWriter indexWriter, File dataDir, String suffix) throws IOException { 
    try { 
     for (File f : dataDir.listFiles()) { 
      if (f.isDirectory()) { 
       indexDirectory(indexWriter, f, suffix); 
      } else { 
       indexFileWithIndexWriter(indexWriter, f, suffix); 
      } 
     } 
    } catch (Exception ex) { 
     System.out.println("exception 2 is" + ex); 
    } 
} 

private void indexFileWithIndexWriter(IndexWriter indexWriter, File f, 
    String suffix) throws IOException { 
    try { 
     if (f.isHidden() || f.isDirectory() || !f.canRead() || !f.exists()) { 
      return; 
     } 
     if (suffix != null && !f.getName().endsWith(suffix)) { 
      return; 
     } 
     Document doc = new Document(); 
     doc.add(new Field("contents", new FileReader(f))); 
     doc.add(new Field("filename", f.getFileName(), 
       Field.Store.YES, Field.Index.ANALYZED)); 
     indexWriter.addDocument(doc); 
    } catch (Exception ex) { 
     System.out.println("exception 4 is" + ex); 
    } 
} 

在此先感謝

回答

9

這行代碼是爲什麼不被存儲你的內容的原因:

doc.add(new Field("contents", new FileReader(f))); 

這種方法不存儲內容被索引。

如果您嘗試索引HTML文件,請嘗試使用JTidy。這將使這個過程變得更容易。

示例代碼:

public class JTidyHTMLHandler { 

    public org.apache.lucene.document.Document getDocument(InputStream is) throws DocumentHandlerException { 
     Tidy tidy = new Tidy(); 
     tidy.setQuiet(true); 
     tidy.setShowWarnings(false); 
     org.w3c.dom.Document root = tidy.parseDOM(is, null); 
     Element rawDoc = root.getDocumentElement(); 

     org.apache.lucene.document.Document doc = 
       new org.apache.lucene.document.Document(); 

     String body = getBody(rawDoc); 

     if ((body != null) && (!body.equals(""))) { 
      doc.add(new Field("contents", body, Field.Store.NO, Field.Index.ANALYZED)); 
     } 

     return doc; 
    } 

    protected String getTitle(Element rawDoc) { 
     if (rawDoc == null) { 
      return null; 
     } 

     String title = ""; 

     NodeList children = rawDoc.getElementsByTagName("title"); 
     if (children.getLength() > 0) { 
      Element titleElement = ((Element) children.item(0)); 
      Text text = (Text) titleElement.getFirstChild(); 
      if (text != null) { 
       title = text.getData(); 
      } 
     } 
     return title; 
    } 

    protected String getBody(Element rawDoc) { 
     if (rawDoc == null) { 
      return null; 
     } 

     String body = ""; 
     NodeList children = rawDoc.getElementsByTagName("body"); 
     if (children.getLength() > 0) { 
      body = getText(children.item(0)); 
     } 
     return body; 
    } 

    protected String getText(Node node) { 
     NodeList children = node.getChildNodes(); 
     StringBuffer sb = new StringBuffer(); 
     for (int i = 0; i < children.getLength(); i++) { 
      Node child = children.item(i); 
      switch (child.getNodeType()) { 
       case Node.ELEMENT_NODE: 
        sb.append(getText(child)); 
        sb.append(" "); 
        break; 
       case Node.TEXT_NODE: 
        sb.append(((Text) child).getData()); 
        break; 
      } 
     } 
     return sb.toString(); 
    } 
} 

從一個URL獲得的InputStream:

InputStream stream = new FileInputStream(new File (htmlFile)); 
+0

先生哪裏是文字類:

URL url = new URL(htmlURLlocation); URLConnection connection = url.openConnection(); InputStream stream = connection.getInputStream(); 

從文件中獲得的InputStream整潔無法使用它,我怎麼能給文件位置的輸入流對象感謝和問候 – adesh

+0

文本類是org.w3c.dom.Text。它帶有Java。 –

+0

編輯答案顯示如何從文件位置獲取輸入流 –