2012-08-23 39 views
7

我從Lucene 3.6升級到Lucene 4.0-beta。在Lucene 3.x中,IndexReader包含一個方法IndexReader.getTermFreqVectors(),我可以使用它來提取給定文檔和字段中每個術語的頻率。Lucene 4.0中的Term Vector頻率

此方法現在替換爲IndexReader.getTermVectors(),它返回Terms。我怎樣才能利用這個(或者其他方法)來提取文檔和字段中的詞頻?

+0

相關的http://stackoverflow.com/questions/13537126/term-frequency-in-lucene-4-0?rq=1和http ://stackoverflow.com/questions/1844194/get-cosine-similarity-between-two-documents-in-lucene –

回答

1

有關於如何使用靈活的索引API的各種文檔:

訪問字段/條款的文件項向量是完全相同的API,您用於訪問發佈列表,因爲術語向量實際上只是該文檔的縮微倒排索引。

因此,儘可能使用所有這些示例,儘管您可以製作一些快捷方式,因爲您知道在此「微縮倒排索引」中只有一個文檔。例如如果您只想獲得術語的頻率,那麼您可以使用總體統計信息(如totalTermFreq)(請參閱https://builds.apache.org/job/Lucene-Artifacts-4.x/javadoc/core/org/apache/lucene/index/package-summary.html#stats),而不是實際打開僅列舉單個文檔的DocsEnum。

3

看到這個related question,specificially

Terms vector = reader.getTermVector(docId, CONTENT); 
TermsEnum termsEnum = null; 
termsEnum = vector.iterator(termsEnum); 
Map<String, Integer> frequencies = new HashMap<>(); 
BytesRef text = null; 
while ((text = termsEnum.next()) != null) { 
    String term = text.utf8ToString(); 
    int freq = (int) termsEnum.totalTermFreq(); 
    frequencies.put(term, freq); 
    terms.add(term); 
} 
+0

在最後一步,變量「terms」是什麼? –

0

我有這方面的工作對我的Lucene 4.2索引。這是一個適合我的小測試程序。

try { 
    directory[0] = new SimpleFSDirectory(new File(test1)); 
    directory[1] = new SimpleFSDirectory(new File(test2)); 
    directory[2] = new SimpleFSDirectory(new File(test3)); 
    directoryReader[0] = DirectoryReader.open(directory[0]); 
    directoryReader[1] = DirectoryReader.open(directory[1]); 
    directoryReader[2] = DirectoryReader.open(directory[2]); 

    if (!directoryReader[2].isCurrent()) { 
     directoryReader[2] = DirectoryReader.openIfChanged(directoryReader[2]); 
    } 
    MultiReader mr = new MultiReader(directoryReader); 

    TermStats[] stats=null; 
    try { 
     stats = HighFreqTerms.getHighFreqTerms(mr, 100, "My Term"); 
    } catch (Exception e1) { 
     e1.printStackTrace(); 
     return; 
    } 

    for (TermStats termstat : stats) { 
     System.out.println("IBI_body: " + termstat.termtext.utf8ToString() + 
      ", docFrequency: " + termstat.docFreq); 
    } 
} 
12

也許這將幫助你:

// get terms vectors for one document and one field 
Terms terms = reader.getTermVector(docID, "fieldName"); 

if (terms != null && terms.size() > 0) { 
    // access the terms for this field 
    TermsEnum termsEnum = terms.iterator(null); 
    BytesRef term = null; 

    // explore the terms for this field 
    while ((term = termsEnum.next()) != null) { 
     // enumerate through documents, in this case only one 
     DocsEnum docsEnum = termsEnum.docs(null, null); 
     int docIdEnum; 
     while ((docIdEnum = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) { 
      // get the term frequency in the document 
      System.out.println(term.utf8ToString()+ " " + docIdEnum + " " + docsEnum.freq()); 
     } 
    } 
} 
+0

它至少幫助了我!感謝您的這些線路 – lizzie