2012-01-11 78 views
0

我試圖找到一個演示Lucene或其他類型的索引,可以檢查英文第一&姓氏組合可能重複的示例。重複的支票需要能夠考慮到常見的綽號,即鮑勃羅伯特和比爾威廉,以及拼寫錯誤。有誰知道一個例子?檢測重複的英文名稱

我打算在用戶註冊期間執行重複搜索。新的用戶記錄需要根據從存儲用戶名的數據庫表構建的索引進行檢查。

回答

2

我會在編制索引時在firstName上使用SynonymFilter,以便您擁有所有可能的組合(Bob - > Robert,Robert - > Bob等)。索引您擁有的現有用戶。

然後使用QueryParser(在分析器中沒有SynonymFilter)來詢問一些模糊查詢。

這是我想出了代碼:

public class NameDuplicateTests { 
    private Analyzer analyzer; 
    private IndexSearcher searcher; 
    private IndexReader reader; 
    private QueryParser qp; 

    private final static Multimap<String, String> firstNameSynonyms; 
    static { 
     firstNameSynonyms = HashMultimap.create(); 
     List<String> robertSynonyms = ImmutableList.of("Bob", "Bobby", "Robert"); 
     for (String name: robertSynonyms) { 
      firstNameSynonyms.putAll(name, robertSynonyms); 
     } 
     List<String> willSynonyms = ImmutableList.of("William", "Will", "Bill", "Billy"); 
     for (String name: willSynonyms) { 
      firstNameSynonyms.putAll(name, willSynonyms); 
     } 
    } 

    public static Analyzer createAnalyzer() { 
     return new Analyzer() { 
      @Override 
      public TokenStream tokenStream(String fieldName, Reader reader) { 
       TokenStream tokenizer = new WhitespaceTokenizer(reader); 
       if (fieldName.equals("firstName")) { 
        tokenizer = new SynonymFilter(tokenizer, new SynonymEngine() { 
         @Override 
         public String[] getSynonyms(String s) throws IOException { 
          return firstNameSynonyms.get(s).toArray(new String[0]); 
         } 
        }); 
       } 
       return tokenizer; 
      } 
     }; 
    } 


    @Before 
    public void setUp() throws Exception { 
     Directory dir = new RAMDirectory(); 
     analyzer = createAnalyzer(); 

     IndexWriter writer = new IndexWriter(dir, analyzer, IndexWriter.MaxFieldLength.UNLIMITED); 
     ImmutableList<String> firstNames = ImmutableList.of("William", "Robert", "Bobby", "Will", "Anton"); 
     ImmutableList<String> lastNames = ImmutableList.of("Robert", "Williams", "Mayor", "Bob", "FunkyMother"); 

     for (int id = 0; id < firstNames.size(); id++) { 
      Document doc = new Document(); 
      doc.add(new Field("id", String.valueOf(id), Field.Store.YES, Field.Index.NOT_ANALYZED)); 
      doc.add(new Field("firstName", firstNames.get(id), Field.Store.YES, Field.Index.ANALYZED)); 
      doc.add(new Field("lastName", lastNames.get(id), Field.Store.YES, Field.Index.NOT_ANALYZED)); 
      writer.addDocument(doc); 
     } 
     writer.close(); 

     qp = new QueryParser(Version.LUCENE_30, "firstName", new WhitespaceAnalyzer()); 
     searcher = new IndexSearcher(dir); 
     reader = searcher.getIndexReader(); 
    } 

    @After 
    public void tearDown() throws Exception { 
     searcher.close(); 
    } 

    @Test 
    public void testNameFilter() throws Exception { 
     search("+firstName:Bob +lastName:Williams"); 
     search("+firstName:Bob +lastName:Wolliam~"); 
    } 

    private void search(String query) throws ParseException, IOException { 
     Query q = qp.parse(query); 
     System.out.println(q); 
     TopDocs res = searcher.search(q, 3); 
     for (ScoreDoc sd: res.scoreDocs) { 
      Document doc = reader.document(sd.doc); 
      System.out.println("Found " + doc.get("firstName") + " " + doc.get("lastName")); 
     } 
    } 
} 

導致:

+firstName:Bob +lastName:Williams 
Found Robert Williams 
+firstName:Bob +lastName:wolliam~0.5 
Found Robert Williams 

希望幫助!