2016-06-21 54 views
0

我想我的指數由Nutch的抓取運行數據:索引中SOLR:修正分析儀不會產生巨大的術語

bin/nutch index -D solr.server.url="http://localhost:8983/solr/carerate" crawl/crawldb -linkdb crawl/linkdb crawl/segments/2016* 

起初它被完全的工作好。我編制了我的數據索引,發送了一些查詢並收到了很好的結果。但後來我跑再次爬行,所以它獲取更多的網頁,而現在,當我運行的Nutch指數命令,我面對

java.io.IOException: Job failed!

這裏是我的Hadoop日誌:

java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Exception writing document id http://www.cs.toronto.edu/~frank/About_Me/about_me.html to the index; possible analysis error: Document contains at least one immense term in field="content" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[70, 114, 97, 110, 107, 32, 82, 117, 100, 122, 105, 99, 122, 32, 45, 32, 65, 98, 111, 117, 116, 32, 77, 101, 32, 97, 98, 111, 117, 116]...', original message: bytes can be at most 32766 in length; got 40063. Perhaps the document has an indexed string field (solr.StrField) which is too large at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529) Caused by: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Exception writing document id http://www.cs.toronto.edu/~frank/About_Me/about_me.html to the index; possible analysis error: Document contains at least one immense term in field="content" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[70, 114, 97, 110, 107, 32, 82, 117, 100, 122, 105, 99, 122, 32, 45, 32, 65, 98, 111, 117, 116, 32, 77, 101, 32, 97, 98, 111, 117, 116]...', original message: bytes can be at most 32766 in length; got 40063. Perhaps the document has an indexed string field (solr.StrField) which is too large at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124) at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:153) at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:115) at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:44) at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.close(ReduceTask.java:502) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:456) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2016-06-21 13:27:37,994 ERROR indexer.IndexingJob - Indexer: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145) at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231)

我意識到在提到的頁面中必須有一個非常長的期限。 因此,在schema.xml(in nutch)和managed-schema(solr)中,我將「id」,「content」和「text」從「strings」改爲「text_general」: 但它沒有解決這個問題。

我不是專家,所以我不知道如何糾正分析器而不要搞砸別的東西。我讀過,我可以: 1.使用(在索引分析器中),一個LengthFilterFactory爲了過濾那些不符合請求長度範圍的標記。 2.use(在索引分析器中)使用TruncateTokenFilterFactory來修復索引標記的最大長度

但是架構中有這麼多分析器。我應該更改分析儀的定義嗎?如果是的話,因爲內容和其他字段的類型是text_general,是不是也會影響它們呢?

任何人都知道我該如何解決這個問題?我真的很感激任何幫助。

順便說一句,我使用nutch 1.11和solr 6.0.0。

回答

1

假設你使用的是與Nutch的捆綁爲您Solr的安裝基礎架構的schema.xml,基本上你只需要其中的任一過濾器(LengthFilterFactoryTruncateTokenFilterFactory)的添加到text_general字段類型。

text_generalfieldTypehttps://github.com/apache/nutch/blob/master/conf/schema.xml#L108-L123)的初始定義出發,您需要添加以下到<analyzer type="index">部分:

... 
<analyzer type="index"> 
    <tokenizer class="solr.StandardTokenizerFactory"/> 
    <!-- remove long tokens --> 
    <filter class="solr.LengthFilterFactory" min="3" max="7"/> 
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> 
    <filter class="solr.LowerCaseFilterFactory"/> 
</analyzer> 
... 

這也可以適用於使用相同的語法分析器query。如果你想使用TruncateTokenFilterFactory過濾器只是交換與添加一行:

<filter class="solr.TruncateTokenFilterFactory" prefixLength="5"/> 

另外,不要忘記每個濾波器的參數調整,以您的需求(minmaxLengthFilterFactory)和prefixLengthTruncateTokenFilterFactory

回答您的其他問題:是的,這會影響text_general類型的所有字段,但這並不是那麼有問題,因爲如果您在任何其他字段中找到另一個超長期術語,則會引發相同的錯誤。如果您仍然希望僅爲content字段隔離此更改,則只需創建一個新名稱爲fieldType的新名稱(例如,truncated_text_general,例如,複製&粘貼整個fieldType部分並更改名稱屬性),然後更改該名稱的類型content字段(https://github.com/apache/nutch/blob/master/conf/schema.xml#L339)以匹配您新創建的fieldType

這就是說,只需選擇過濾器的理智值,以避免錯過您的索引中的很多條款。

+0

感謝您的回覆Jorge。雖然你解釋了它的工作方式非常好,正如我在問題主體中提到的那樣,但我確實嘗試了這一點,但不幸的是它並沒有解決我的問題。 –