2017-05-02 28 views
0

這個問題生成的令牌是我申請一個固定的FEMMES.COM無法正常令牌化(How do I get french text FEMMES.COM to index as language variants of FEMMES如何我可以保證語言分析應用於由WordDelimiterTokenFilter

失敗的測試案例後面臨的新形勢:#FEMMES2017應該標記爲Femmes,Femme,2017.

我的方法使用MappingCharFilter是不正確的,而且真的只是一個創可貼。什麼是正確的方法來讓這個失敗的測試案例通過?

當前索引配置

"analyzers": [ 
    { 
     "@odata.type": "#Microsoft.Azure.Search.CustomAnalyzer", 
     "name": "text_language_search_custom_analyzer", 
     "tokenizer": "text_language_search_custom_analyzer_ms_tokenizer", 
     "tokenFilters": [ 
     "lowercase", 
     "text_synonym_token_filter", 
     "asciifolding", 
     "language_word_delim_token_filter" 
     ], 
     "charFilters": [ 
     "html_strip", 
     "replace_punctuation_with_comma" 
     ] 
    }, 
    { 
     "@odata.type": "#Microsoft.Azure.Search.CustomAnalyzer", 
     "name": "text_exact_search_Index_custom_analyzer", 
     "tokenizer": "text_exact_search_Index_custom_analyzer_tokenizer", 
     "tokenFilters": [ 
     "lowercase", 
     "asciifolding" 
     ], 
     "charFilters": [] 
    } 
    ], 
    "tokenizers": [ 
    { 
     "@odata.type": "#Microsoft.Azure.Search.MicrosoftLanguageStemmingTokenizer", 
     "name": "text_language_search_custom_analyzer_ms_tokenizer", 
     "maxTokenLength": 300, 
     "isSearchTokenizer": false, 
     "language": "french" 
    }, 
    { 
     "@odata.type": "#Microsoft.Azure.Search.StandardTokenizerV2", 
     "name": "text_exact_search_Index_custom_analyzer_tokenizer", 
     "maxTokenLength": 300 
    } 
    ], 
    "tokenFilters": [ 
    { 
     "@odata.type": "#Microsoft.Azure.Search.SynonymTokenFilter", 
     "name": "text_synonym_token_filter", 
     "synonyms": [ 
     "ca => ça", 
     "yeux => oeil", 
     "oeufs,oeuf,Œuf,Œufs,œuf,œufs", 
     "etre,ete" 
     ], 
     "ignoreCase": true, 
     "expand": true 
    }, 
    { 
     "@odata.type": "#Microsoft.Azure.Search.WordDelimiterTokenFilter", 
     "name": "language_word_delim_token_filter", 
     "generateWordParts": true, 
     "generateNumberParts": true, 
     "catenateWords": false, 
     "catenateNumbers": false, 
     "catenateAll": false, 
     "splitOnCaseChange": true, 
     "preserveOriginal": false, 
     "splitOnNumerics": true, 
     "stemEnglishPossessive": true, 
     "protectedWords": [] 
    } 
    ], 
    "charFilters": [ 
    { 
     "@odata.type": "#Microsoft.Azure.Search.MappingCharFilter", 
     "name": "replace_punctuation_with_comma", 
     "mappings": [ 
     "#=>,", 
     "$=>,", 
     "€=>,", 
     "£=>,", 
     "%=>,", 
     "&=>,", 
     "+=>,", 
     "/=>,", 
     "==>,", 
     "<=>,", 
     ">=>,", 
     "@=>,", 
     "_=>,", 
     "µ=>,", 
     "§=>,", 
     "¤=>,", 
     "°=>,", 
     "!=>,", 
     "?=>,", 
     "\"=>,", 
     "'=>,", 
     "`=>,", 
     "~=>,", 
     "^=>,", 
     ".=>,", 
     ":=>,", 
     ";=>,", 
     "(=>,", 
     ")=>,", 
     "[=>,", 
     "]=>,", 
     "{=>,", 
     "}=>,", 
     "*=>,", 
     "-=>," 
     ] 
    } 
    ] 

分析API調用

{ 
    "analyzer": "text_language_search_custom_analyzer", 
    "text": "#femmes2017" 
} 

分析API響應

{ 
    "@odata.context": "https://one-adscope-search-eu-prod.search.windows.net/$metadata#Microsoft.Azure.Search.V2016_09_01.AnalyzeResult", 
    "tokens": [ 
    { 
     "token": "femmes", 
     "startOffset": 1, 
     "endOffset": 7, 
     "position": 0 
    }, 
    { 
     "token": "2017", 
     "startOffset": 7, 
     "endOffset": 11, 
     "position": 1 
    } 
    ] 
} 

回答

0

輸入文本是通過分析儀的部件,以便處理:炭過濾器 - >標記器 - >標記過濾器。在你的情況下,標記器在標記由WordDelimiter標記過濾器處理之前執行詞形化。不幸的是,微軟的詞幹和混淆器不可用作獨立的標記過濾器,你可以在WordDelimiter標記過濾器之後應用。您將需要添加另一個令牌過濾器,以根據您的要求規範化WordDelimiter令牌過濾器的輸出。只有在這種情況下,您可以將SynonymsTokenFilter移動到分析器鏈的末尾,並將其映射到femmetfemme。這顯然不是一個很好的解決方法,因爲它對你正在處理的數據非常具體。希望我提供的信息將幫助您找到更通用的解決方案。

+0

這是我們正在取代的網站在這一點上的優勢。他們的SOLR配置允許這個鏈。 –

+0

您可以在WordDelimiter標記過濾器之後始終使用Lucene Stemmer標記過濾器,但記住它會阻止分析器產生的所有標記。 – Yahnoosh

+0

你是說在這個頁面上StemmerTokenFilter? https://docs.microsoft.com/en-us/rest/api/searchservice/custom-analyzers-in-azure-search 描述是「語言特定的詞幹過濾器」。所以這隻會表現出來,並且沒有詞形化呢? 我想沒有HunspellStemFilterFactory等價物,我可以只喂這個.dic和.aff文件舊網站有? –

相關問題