使用Lucene的RegexQuery時匹配的片段

我正在用RegexQuery在Lucene索引上執行查詢操作。例如，我使用RegexQuery獲取包含URL的所有文檔，使用new RegexQuery(new Term("text", "https?://[^\\s]+")（我知道，RegEx過於簡化）。使用Lucene的RegexQuery時匹配的片段

現在我想檢索文本片段，它實際上匹配我的查詢，如http://example.com。 Lucene是否提供了一種高效的可能性？還是使用Java的RegEx匹配器再次處理整個文本？

來源

2011-03-01 qqilihq

我認爲，你到底想要的東西是不可能的，但這裏是一個不同的方法，也有類似的效果：

打開一個的IndexReader，得到它們的「http」後面的所有術語（按字典順序1有序），直到他們不以「http：//」或「https：//」了：

final IndexReader reader = IndexReader.open(IndexHelper.DIRECTORY, true); 
    final TermEnum termEnum = reader.terms(new Term("text", "http")); 
    final List<Term> terms = new ArrayList<Term>(); 
    Term foundTerm = termEnum.term(); 

    // if the first term does not match url pattern: advance until it first matches 
    if (!(foundTerm.text().startsWith("https://") || foundTerm.text().startsWith("http://"))) { 
     while (termEnum.next()) { 
      foundTerm = termEnum.term(); 
      if (foundTerm.text().startsWith("https://") || foundTerm.text().startsWith("http://")) { 
       break; 
      } 
     } 
    } 
    // collect all terms 
    while ((foundTerm.text().startsWith("https://") || foundTerm.text().startsWith("http://")) && termEnum.next()) { 
     foundTerm = termEnum.term(); 
     terms.add(foundTerm); 
    }

產生的網址然後在「術語」列表中，爲Lucene的條款。

這當然有一個缺點，就是你沒有得到這些URL的文檔，但你可以用查找到的術語再次查詢它們。

我把它放在這裏的方式不是非常靈活（但可能會更快地完成任務），但您當然可以回到模式以獲得更大的靈活性。然後，您將用yourPattern.matches(foundTerm.text())替換所有foundTerm.text().startsWith("https://") || foundTerm.text().startsWith("http://")。

對不起，我寫了這麼多^^。

我希望它有幫助。

來源

2011-03-02 12:07:29 Enduriel

謝謝您的建議，併爲我遲到的迴應Enduriel感到抱歉。不幸的是，我的例子可能有點過分，因爲我不僅需要查詢URL，而且還需要更復雜的模式，因此我不能使用前綴解決方案，但我肯定必須使用RegexQuery。 – qqilihq 2011-03-03 14:24:59

嗨qqilihq，並抱歉對我的遲到答覆以及。那麼，我想我的前綴方法將無濟於事。我只用它一次自動完成（這工作很好），但從來沒有更復雜的東西。所以我不知道什麼更好，但我希望你會找到一個好的解決方案。 – Enduriel 2011-03-07 13:03:00

使用Lucene的RegexQuery時匹配的片段

回答

相關問題