2013-07-22 31 views
0

假設一個lucene索引帶有字段:日期,內容。 我想獲取所有期限爲昨天的文檔的價值和頻率。日期字段是關鍵字字段。內容字段進行分析和索引。使用Lucene 4.3.1,如何獲取所有文檔的子範圍中出現的所有術語

請幫助我的示例代碼。

+0

顯示你已經嘗試過的代碼? – Alicia

+0

我可以提取完整的條款...... Fields fields = MultiFields.getFields(searcher.getIndexReader());條款terms = fields.terms(「content」); TermsEnum eachTerm = terms.iterator(null); –

+0

我的臨時解決方案是1.得到具有一定日期範圍的docid,2.通過程序分析每個文件並創建詞頻率3.排序頻率4.得到top-n詞。是否還有其他解決方案只有lucene api?請讓我知道! –

回答

0

我的解決辦法來源如下...

/** 
* 
* 
* @param reader 
* @param fromDateTime 
*   - yyyymmddhhmmss 
* @param toDateTime 
*   - yyyymmddhhmmss 
* @return 
*/ 
static public String top10(IndexSearcher searcher, String fromDateTime, 
     String toDateTime) { 
    String top10Query = ""; 
    try { 
     Query query = new TermRangeQuery("tweetDate", new BytesRef(
       fromDateTime), new BytesRef(toDateTime), true, false); 
     final BitSet bits = new BitSet(searcher.getIndexReader().maxDoc()); 
     searcher.search(query, new Collector() { 

      private int docBase; 

      @Override 
      public void setScorer(Scorer scorer) throws IOException { 
      } 

      @Override 
      public void setNextReader(AtomicReaderContext context) 
        throws IOException { 
       this.docBase = context.docBase; 
      } 

      @Override 
      public void collect(int doc) throws IOException { 
       bits.set(doc + docBase); 
      } 

      @Override 
      public boolean acceptsDocsOutOfOrder() { 
       return false; 
      } 
     }); 

     // 
     Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_43, 
       EnglishStopWords.getEnglishStopWords()); 

     // 
     HashMap<String, Long> wordFrequency = new HashMap<>(); 
     for (int wx = 0; wx < bits.length(); ++wx) { 
      if (bits.get(wx)) { 
       Document wd = searcher.doc(wx); 
       // 
       TokenStream tokenStream = analyzer.tokenStream("temp", 
         new StringReader(wd.get("content"))); 
       // OffsetAttribute offsetAttribute = tokenStream 
       // .addAttribute(OffsetAttribute.class); 
       CharTermAttribute charTermAttribute = tokenStream 
         .addAttribute(CharTermAttribute.class); 
       tokenStream.reset(); 
       while (tokenStream.incrementToken()) { 
        // int startOffset = offsetAttribute.startOffset(); 
        // int endOffset = offsetAttribute.endOffset(); 
        String term = charTermAttribute.toString(); 
        if (term.length() < 2) 
         continue; 
        Long wl; 
        if ((wl = wordFrequency.get(term)) == null) 
         wordFrequency.put(term, 1L); 
        else { 
         wl += 1; 
         wordFrequency.put(term, wl); 
        } 
       } 
       tokenStream.end(); 
       tokenStream.close(); 
      } 
     } 
     analyzer.close(); 

     // sort 
     List<String> occurterm = new ArrayList<String>(); 
     for (String ws : wordFrequency.keySet()) { 
      occurterm.add(String.format("%06d\t%s", wordFrequency.get(ws), 
        ws)); 
     } 
     Collections.sort(occurterm, Collections.reverseOrder()); 

     // make query string by top 10 words 
     int topCount = 10; 
     for (String ws : occurterm) { 
      if (topCount-- == 0) 
       break; 
      String[] tks = ws.split("\\t"); 
      top10Query += tks[1] + " "; 
     } 
     top10Query.trim(); 
    } catch (IOException e) { 
     e.printStackTrace(); 
    } finally { 
    } 
    // return top10 word string 
    return top10Query; 
} 
相關問題