2013-06-20 86 views
1

我想使用lucene實現「在文件中查找」類似於IDE中的查找。基本上要搜索源代碼文件,如.c,.cpp,.h,.cs和.xml。我嘗試了在Apache網站上顯示的演示。它返回文件列表,沒有行號和文件中出現的次數。我相信應該有一些方法來獲得它。Lucene可以用行號返回搜索結果嗎?

有無論如何得到這些細節?

回答

0

我試了很多論壇,響應爲零。所以最後我從@Luca Mastrostefano答案得到了一個想法,以獲得行號的細節。

lucene搜索器的Taginfo返回文件名。我認爲這足以獲得行號。 Lucene索引沒有存儲實際的內容,它實際上存儲了散列值。所以不可能直接得到行號。因此,我假設只有使用該路徑並讀取文件並獲取行號。

public static void PrintLines(string filepath,string key) 
    { 
     int counter = 1; 
     string line; 

     // Read the file and display it line by line. 
     System.IO.StreamReader file = new System.IO.StreamReader(filepath); 
     while ((line = file.ReadLine()) != null) 
     { 
      if (line.Contains(key)) 
      { 
       Console.WriteLine("\t"+counter.ToString() + ": " + line); 
      } 
      counter++; 
     } 
     file.Close(); 
    } 

從lucene搜索器的路徑後調用此函數。

1

請問您可以分享apache網站上顯示的演示鏈接嗎?

在這裏,我告訴你如何得到一個學期給定的文檔的詞頻:

public static void main(final String[] args) throws CorruptIndexException, 
      LockObtainFailedException, IOException { 

     // Create the index 
     final Directory directory = new RAMDirectory(); 
     final Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36); 
     final IndexWriterConfig config = new IndexWriterConfig(
       Version.LUCENE_36, analyzer); 
     final IndexWriter writer = new IndexWriter(directory, config); 

     // addDoc(writer, field, text); 
     addDoc(writer, "title", "foo"); 
     addDoc(writer, "title", "buz qux"); 
     addDoc(writer, "title", "foo foo bar"); 

     // Search 
     final IndexReader reader = IndexReader.open(writer, false); 
     final IndexSearcher searcher = new IndexSearcher(reader); 

     final Term term = new Term("title", "foo"); 
     final Query query = new TermQuery(term); 
     System.out.println("Query: " + query.toString() + "\n"); 

     final int limitShow = 3; 
     final TopDocs td = searcher.search(query, limitShow); 
     final ScoreDoc[] hits = td.scoreDocs; 

     // Take IDs and frequencies 
     final int[] docIDs = new int[td.totalHits]; 
     for (int i = 0; i < td.totalHits; i++) { 
      docIDs[i] = hits[i].doc; 
     } 
     final Map<Integer, Integer> id2freq = getFrequencies(reader, term, 
       docIDs); 

     // Show results 
     for (int i = 0; i < td.totalHits; i++) { 
      final int docNum = hits[i].doc; 
      final Document doc = searcher.doc(docNum); 
      System.out.println("\tposition " + i); 
      System.out.println("Title: " + doc.get("title")); 
      final int freq = id2freq.get(docNum); 
      System.out.println("Occurrences of \"" + term.text() + "\" in \"" 
        + term.field() + "\" = " + freq); 
      System.out.println("--------------------------------\n"); 
     } 
     searcher.close(); 
     reader.close(); 
     writer.close(); 
    } 

這裏我們添加文件索引:

private static void addDoc(final IndexWriter w, final String field, 
      final String text) throws CorruptIndexException, IOException { 
     final Document doc = new Document(); 
     doc.add(new Field(field, text, Field.Store.YES, Field.Index.ANALYZED)); 
     doc.add(new Field(field, text, Field.Store.YES, Field.Index.ANALYZED)); 
     w.addDocument(doc); 
} 

這是一個例子如何採取一個文檔術語的occurrencies數量:

public static Map<Integer, Integer> getFrequencies(
     final IndexReader reader, final Term term, final int[] docIDs) 
     throws CorruptIndexException, IOException { 
    final Map<Integer, Integer> id2freq = new HashMap<Integer, Integer>(); 
    final TermDocs tds = reader.termDocs(term); 
    if (tds != null) { 
     for (final int docID : docIDs) { 
      // Skip to the next docID 
      tds.skipTo(docID); 
      // Get its term frequency 
      id2freq.put(docID, tds.freq()); 
     } 
    } 
    return id2freq; 
} 

如果你把所有的togethe r和你運行它,你就會得到這樣的輸出:

Query: title:foo 

    position 0 
Title: foo 
Occurrences of "foo" in "title" = 2 
-------------------------------- 

    position 1 
Title: foo foo bar 
Occurrences of "foo" in "title" = 4 
-------------------------------- 
+0

[鏈接] http://lucene.apache.org/core/4_3_1/demo/overview-summary.html#overview_description – Ganeshkumar

+0

截至目前,我還沒有寫任何代碼。我使用給定的indexfile二進制文件爲一個目錄創建了lucene索引。然後在該索引中搜索一個單詞返回包含該單詞的文件名。但是我需要在這個文件和匹配的行號碼的出現次數上多加一些信息。 – Ganeshkumar

+0

最簡單的解決方案是分別對每行進行索引(使用常見的file_ID和唯一的line_number),執行查詢並檢查結果以提取出現的次數和出現的行數。 否則在這裏[鏈接](http://stackoverflow.com/questions/1311199/finding-the-position-of-search-hits-from-lucene)你可以找到類似你想要的東西。 –

相關問題