2012-09-25 36 views
0

我在索引3.4中的中文/日文文本時遇到問題。我導入使用DIH的數據,連接塊Solr DataImport中的漢字問題

 
<dataSource type="JdbcDataSource" 
    driver="com.mysql.jdbc.Driver" 
    url="jdbc:mysql://localhost/db_development?useUnicode=true&amp;characterEncoding=UTF-8&amp;characterSetResults=UTF-8" 
    user="user" 
    useUnicode="true" 
    characterEncoding="UTF-8" 
    encoding="UTF-8" 
    password="password" 
    zeroDateTimeBehavior="convertToNull" 
    name="app" /> 

此字段的字段類型DEFN竟把

 
    <fieldType name="text_commongrams" class="solr.TextField"> 
    <analyzer> 
     <charFilter class="solr.HTMLStripCharFilterFactory" /> 
     <tokenizer class="solr.ICUTokenizerFactory" /> 
     <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/> 
     <filter class="solr.ICUFoldingFilterFactory"/> 
     <filter class="solr.ASCIIFoldingFilterFactory"/> 
     <filter class="solr.ICUNormalizer2FilterFactory" name="nfkc_cf" mode="compose"/> 
     <filter class="solr.RemoveDuplicatesTokenFilterFactory" /> 
     <filter class="solr.TrimFilterFactory" /> 
     <filter class="solr.LowerCaseFilterFactory" /> 
     <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/> 
    <filter class="solr.SynonymFilterFactory" 
     synonyms="synonyms.txt" 
     ignoreCase="true" 
     expand="true" /> 
    <filter class="solr.CommonGramsFilterFactory" 
     words="stopwords_en.txt" 
     ignoreCase="true" /> 
    <filter class="solr.StopFilterFactory" 
     words="stopwords_en.txt" 
     ignoreCase="true" /> 
    <filter class="solr.WordDelimiterFilterFactory" 
     generateWordParts="1" 
     splitOnNumerics="0" 
     generateNumberParts="1" 
     catenateWords="1" 
     catenateNumbers="1" 
     catenateAll="0" 
     preserveOriginal="1" /> 
    </analyzer> 
</fieldType> 

MySQL字符編碼的細節,如

 
+--------------------------+-----------------------------------------+ 
| Variable_name   | Value         | 
+--------------------------+-----------------------------------------+ 
| character_set_client  | latin1         | 
| character_set_connection | latin1         | 
| character_set_database | latin1         | 
| character_set_filesystem | binary         | 
| character_set_results | latin1         | 
| character_set_server  | utf8         | 
| character_set_system  | utf8         | 
| character_sets_dir  | /opt/local/share/mysql5/mysql/charsets/ | 
+--------------------------+-----------------------------------------+ 

我開始Solr與java param -Dfile.encoding=UTF-8

輸入文本是JavaOne Tokyo 2012での発表スライド 當我將其導入到Solr的,和查詢使用的ID文件,我看到的文字爲JavaOne Tokyo 2012ã§ã®ç™ºè¡¨ã‚¹ãƒ©ã‚¤ãƒ‰

誰能告訴我在哪裏,我錯了?

回答

2

所以我最終不得不改變我的MySQL表來存儲UTF8中的字符串。有關如何將現有表格從latin1轉換爲utf8的詳細信息,請參見here