2013-10-08 49 views
0

我正在嘗試爲小語料庫創建術語文檔矩陣以進一步實驗LSI。但是,我無法找到用Lucene 4.4來實現的方法。使用Lucene 4.4生成術語文檔矩陣4.4

我知道如何獲得TermVector每個文檔如下:

//create boolean query to search for a specific document (not shown) 
TopDocs hits = searcher.search(query, 1);  
Terms termVector = reader.getTermVector(hits.scoreDocs[0].doc, "contents"); 
System.out.println(termVector.size()); //just testing 

我想我可以只聯盟所有termVector一起列在矩陣中獲得矩陣。但是,不同文檔的termVector具有不同的大小。我們不知道如何將0填入termVector。所以,當然,這種方法不起作用。

  1. 因此,我不知道是否有人能告訴我如何使用Lucene 4.4創建期限,文檔矢量好嗎? (如果可能,請顯示示例代碼)。

  2. 如果Lucene不支持此功能,那麼建議您採用其他方式執行此操作?

    非常感謝,

回答

1

我找到了解決我的問題here。 Sujit先生給出的非常詳細的例子,雖然代碼是用舊版本的Lucene編寫的,但許多事情都必須改變。當我完成我的代碼時,我會更新細節。 這裏是我的解決方案,Lucene的4.4

工作
public class BuildTermDocumentMatrix { 
public BuildTermDocumentMatrix(File index, File corpus) throws IOException{ 
    reader = DirectoryReader.open(FSDirectory.open(index)); 
    searcher = new IndexSearcher(reader); 
    this.corpus = corpus; 
    termIdMap = computeTermIdMap(reader); 
} 
/** 
* Map term to a fix integer so that we can build document matrix later. 
* It's used to assign term to specific row in Term-Document matrix 
*/ 
private Map<String, Integer> computeTermIdMap(IndexReader reader) throws IOException { 
    Map<String,Integer> termIdMap = new HashMap<String,Integer>(); 
    int id = 0; 
    Fields fields = MultiFields.getFields(reader); 
    Terms terms = fields.terms("contents"); 
    TermsEnum itr = terms.iterator(null); 
    BytesRef term = null; 
    while ((term = itr.next()) != null) {    
     String termText = term.utf8ToString();    
     if (termIdMap.containsKey(termText)) 
      continue; 
     //System.out.println(termText); 
     termIdMap.put(termText, id++); 
    } 

    return termIdMap; 
} 

/** 
* build term-document matrix for the given directory 
*/ 
public RealMatrix buildTermDocumentMatrix() throws IOException { 
    //iterate through directory to work with each doc 
    int col = 0; 
    int numDocs = countDocs(corpus);   //get the number of documents here  
    int numTerms = termIdMap.size(); //total number of terms  
    RealMatrix tdMatrix = new Array2DRowRealMatrix(numTerms, numDocs); 

    for (File f : corpus.listFiles()) { 
     if (!f.isHidden() && f.canRead()) { 
      //I build term document matrix for a subset of corpus so 
      //I need to lookup document by path name. 
      //If you build for the whole corpus, just iterate through all documents 
      String path = f.getPath(); 
      BooleanQuery pathQuery = new BooleanQuery(); 
      pathQuery.add(new TermQuery(new Term("path", path)), BooleanClause.Occur.SHOULD); 
      TopDocs hits = searcher.search(pathQuery, 1); 

      //get term vector 
      Terms termVector = reader.getTermVector(hits.scoreDocs[0].doc, "contents"); 
      TermsEnum itr = termVector.iterator(null); 
      BytesRef term = null; 

      //compute term weight 
      while ((term = itr.next()) != null) {    
       String termText = term.utf8ToString();    
       int row = termIdMap.get(termText); 
       long termFreq = itr.totalTermFreq(); 
       long docCount = itr.docFreq(); 
       double weight = computeTfIdfWeight(termFreq, docCount, numDocs); 
       tdMatrix.setEntry(row, col, weight); 
      } 
      col++; 
     } 
    }  
    return tdMatrix; 
} 
} 
+0

某些功能缺失爲你的類。比如computeTfIdfWeight和countDocs。也BooleanClause不存在。我使用的是lucene 4.6。你能否延長你的答案?我非常好奇測試你的代碼。 – Umingo

相關問題