2013-01-08 32 views
1

我想從我的索引器文件中讀取索引。我如何閱讀和打印Lucene索引4.0

所以我想要的結果是TF-IDF的每個文件和數量的所有條款。

請爲我推薦一些示例代碼。 Thx :)

+0

好像你可能會尋找這樣的事情:http://stackoverflow.com/questions/2311845/is-it-possible-to-iterate-through-documents-存儲功能於lucene的指數 –

回答

-1

第一件事是獲取文件的列表。替代方案可能會迭代索引術語,但方法IndexReader.terms()似乎已從4.0中刪除(儘管它存在於AtomicReader中,值得一看)。我所知道的獲得所有文檔的最佳方法是簡單地循環通過文檔ID的文件:

//where reader is your IndexReader, however you go about opening/managing it 
for (int i=0; i<reader.maxDoc(); i++) { 
    if (reader.isDeleted(i)) 
     continue; 
    //operate on the document with id = i ... 
} 

然後,你需要的所有索引項的列表。我假設我們對存儲字段沒有興趣,因爲你想要的數據對他們來說沒有意義。爲了檢索條款,您可以使用IndexReader.getTermVectors(int)。請注意,我實際上並未檢索文檔,因爲我們不需要直接訪問它。從我們離開的地方繼續:

String field; 
FieldsEnum fieldsiterator; 
TermsEnum termsiterator; 
//To Simplify, you can rely on DefaultSimilarity to calculate tf and idf for you. 
DefaultSimilarity freqcalculator = new DefaultSimilarity() 
//numDocs and maxDoc are not the same thing: 
int numDocs = reader.numDocs(); 
int maxDoc = reader.maxDoc(); 

for (int i=0; i<maxDoc; i++) { 
    if (reader.isDeleted(i)) 
     continue; 
    fieldsiterator = reader.getTermVectors(i).iterator(); 
    while (field = fieldsiterator.next()) { 
     termsiterator = fieldsiterator.terms().iterator(); 
     while (terms.next()) { 
      //id = document id, field = field name 
      //String representations of the current term 
      String termtext = termsiterator.term().utf8ToString(); 
      //Get idf, using docfreq from the reader. 
      //I haven't tested this, and I'm not quite 100% sure of the context of this method. 
      //If it doesn't work, idfalternate below should. 
      int idf = termsiterator.docfreq(); 
      int idfalternate = freqcalculator.idf(reader.docFreq(field, termsiterator.term()), numDocs); 
     } 
    } 
}