2012-03-21 51 views
1

我將Solr用作音樂藝術家/曲目信息的大型語料庫的搜索前端。Solr/Lucene:將「Word-Numbers」轉換爲數字的過濾器

在Lucene/Solr的索引時,是否有過濾器或其他方式將「字號」如「five」轉換爲等效數字(「5」)?

作爲一個例子,搜索「Ben Folds Five」應該返回「Ben Folds 5」作爲結果。

有PatternReplaceFilterFactory,但這樣做在正則表達式似乎是矯枉過正。

+2

你要使用同義詞。這可以在索引時間,查詢時間或兩者都完成。 – 2012-03-21 14:01:57

+2

我會在索引時用同義詞分析器來完成它。我不確定solr中映射到的是什麼,但有人會知道。 – Reactormonk 2012-03-21 14:04:53

+0

@Tass感謝提示傢伙;我查看了[SynonymFilter](http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory),但它似乎需要一個帶有明確映射的文本文件,這對所有可能的數字都是不便的。我錯過了什麼嗎? – Spoom 2012-03-21 14:12:17

回答

1

這裏有一個工作(我用它在過去)的代碼:

import java.util.*; 

class ConvertWordToNumber { 

    public static String WithSeparator(long number) { 
     if (number < 0) { 
      return "-" + WithSeparator(-number); 
     } 
     if (number/1000L > 0) { 
      return WithSeparator(number/1000L) + "," 
        + String.format("%1$03d", number % 1000L); 
     } else { 
      return String.format("%1$d", number); 
     } 
    } 

    private static String[] numerals = { "zero", "one", "two", 
      "three", "four", "five", "six", "seven", "eight", "nine", "ten", 
      "eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen", 
      "seventeen", "eighteen", "ninteen", "twenty", "thirty", "forty", 
      "fifty", "sixty", "seventy", "eighty", "ninety", "hundred" }; 

    private static long[] values = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 
      13, 14, 15, 16, 17, 18, 19, 20, 30, 40, 50, 60, 70, 80, 90, 100 }; 

    private static ArrayList<String> list = new ArrayList<String>(
      Arrays.asList(numerals)); 

    public static long parseNumerals(String text) throws Exception { 
     long value = 0; 
     String[] words = text.replaceAll(" and ", " ").split("\\s"); 
     for (String word : words) { 
      if (!list.contains(word)) { 
       throw new Exception("Unknown token : " + word); 
      } 

      long subval = getValueOf(word); 
      if (subval == 100) { 
       if (value == 0) 
        value = 100; 
       else 
        value *= 100; 
      } else 
       value += subval; 
     } 

     return value; 
    } 

    private static long getValueOf(String word) { 
     return values[list.indexOf(word)]; 
    } 

    private static String[] words = { "trillion", "billion", "million", "thousand" }; 
    private static long[] digits = { 1000000000000L, 1000000000L, 1000000L, 1000L }; 

    public static long parse(String text) throws Exception { 
     text = text.toLowerCase().replaceAll("[\\-,]", " ").replaceAll(" and "," "); 
     long totalValue = 0; 
     boolean processed = false; 
     for (int n = 0; n < words.length; n++) { 
      int index = text.indexOf(words[n]); 
      if (index >= 0) { 
       String text1 = text.substring(0, index).trim(); 
       String text2 = text.substring(index + words[n].length()).trim(); 

       if (text1.equals("")) 
        text1 = "one"; 

       if (text2.equals("")) 
        text2 = "zero"; 

       totalValue = parseNumerals(text1) * digits[n] + parse(text2); 
       processed = true; 
       break; 
      } 
     } 

     if (processed) 
      return totalValue; 
     else 
      return parseNumerals(text); 
    } 


    public static void main(String[] args) throws Exception { 
     Scanner in = new Scanner(System.in); 
     System.out.print("Number in words : "); 
     String numberWordsText = in.nextLine(); 
     System.out.println("Value : " + 
       ConvertWordToNumber.WithSeparator(
       ConvertWordToNumber.parse(numberWordsText))); 
    } 
} 

here服用。

您可以使用它來構建您自己的Solr過濾器。
下面是關於一個體面的職位:

http://robotlibrarian.billdueber.com/building-a-solr-text-filter-for-normalizing-data/

請把它當它這樣做有助於Solr社區。 你可以寫你自己的wiki頁面。

要開始,只要按照類似鏈接到這一個:
http://wiki.apache.org/solr/SolrWordToNumberConverter

+1

謝謝!我會在某個時候處理這​​個問題。目前我已經通過讓模糊搜索以較低的匹配閾值工作來解決這個問題。如果我製作了這樣一個過濾器,我一定會回饋它。 – Spoom 2012-03-22 21:12:50

相關問題