在LUCENE-5472中,如果術語太長,Lucene會更改爲出錯,而不僅僅是記錄消息。這個錯誤指出SOLR不接受令牌超過了32766如何處理SOLR中的「文檔包含至少一個巨大的術語」?
Caused by: java.lang.IllegalArgumentException: Document contains at least one immense term in field="text" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[10, 10, 70, 111, 117, 110, 100, 32, 116, 104, 105, 115, 32, 111, 110, 32, 116, 104, 101, 32, 119, 101, 98, 32, 104, 111, 112, 101, 32, 116]...', original message: bytes can be at most 32766 in length; got 43225
at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:671)
at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344)
at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300)
at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:234)
at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:450)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1475)
at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:239)
at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:163)
... 54 more
Caused by: org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException: bytes can be at most 32766 in length; got 43225
at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:284)
試圖解決這一問題時,我在架構中增加了兩個濾波器(大膽):
<field name="text" type="text_en_splitting" termPositions="true" termOffsets="true" termVectors="true" indexed="true" required="false" stored="true"/>
<fieldType name="text_en_splitting" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<fieldType name="text_en_splitting" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<!-- Case insensitive stop word removal.
-->
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
**<filter class="solr.TruncateTokenFilterFactory" prefixLength="32700"/>
<filter class="solr.LengthFilterFactory" min="2" max="32700" />**
</analyzer>
</fieldType>
由於錯誤仍然是相同的(這使我認爲過濾器沒有正確設置,也許?) 重新啓動服務器是關鍵感謝Bashetti先生
問題是哪一個更好:LengthFilterFactory
或TruncateTokenFilterFactory
?假設一個字節是一個字符(因爲過濾器應該刪除'不尋常'字符?)是正確的嗎?) 謝謝!
爲什麼你LengthFilterFactory因爲它要過濾掉這些標記*不*通過最大包長度爲分鐘。 –
,因爲我想忽略或截斷大於SOLR允許的最大值的元素(或其他) – salvob
更改後是否重新啓動solr服務器? –