我想我的指數由Nutch的抓取運行數據:索引中SOLR:修正分析儀不會產生巨大的術語
bin/nutch index -D solr.server.url="http://localhost:8983/solr/carerate" crawl/crawldb -linkdb crawl/linkdb crawl/segments/2016*
起初它被完全的工作好。我編制了我的數據索引,發送了一些查詢並收到了很好的結果。但後來我跑再次爬行,所以它獲取更多的網頁,而現在,當我運行的Nutch指數命令,我面對
java.io.IOException: Job failed!
這裏是我的Hadoop日誌:
java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Exception writing document id http://www.cs.toronto.edu/~frank/About_Me/about_me.html to the index; possible analysis error: Document contains at least one immense term in field="content" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[70, 114, 97, 110, 107, 32, 82, 117, 100, 122, 105, 99, 122, 32, 45, 32, 65, 98, 111, 117, 116, 32, 77, 101, 32, 97, 98, 111, 117, 116]...', original message: bytes can be at most 32766 in length; got 40063. Perhaps the document has an indexed string field (solr.StrField) which is too large at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529) Caused by: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Exception writing document id http://www.cs.toronto.edu/~frank/About_Me/about_me.html to the index; possible analysis error: Document contains at least one immense term in field="content" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[70, 114, 97, 110, 107, 32, 82, 117, 100, 122, 105, 99, 122, 32, 45, 32, 65, 98, 111, 117, 116, 32, 77, 101, 32, 97, 98, 111, 117, 116]...', original message: bytes can be at most 32766 in length; got 40063. Perhaps the document has an indexed string field (solr.StrField) which is too large at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124) at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:153) at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:115) at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:44) at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.close(ReduceTask.java:502) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:456) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2016-06-21 13:27:37,994 ERROR indexer.IndexingJob - Indexer: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145) at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231)
我意識到在提到的頁面中必須有一個非常長的期限。 因此,在schema.xml(in nutch)和managed-schema(solr)中,我將「id」,「content」和「text」從「strings」改爲「text_general」: 但它沒有解決這個問題。
我不是專家,所以我不知道如何糾正分析器而不要搞砸別的東西。我讀過,我可以: 1.使用(在索引分析器中),一個LengthFilterFactory爲了過濾那些不符合請求長度範圍的標記。 2.use(在索引分析器中)使用TruncateTokenFilterFactory來修復索引標記的最大長度
但是架構中有這麼多分析器。我應該更改分析儀的定義嗎?如果是的話,因爲內容和其他字段的類型是text_general,是不是也會影響它們呢?
任何人都知道我該如何解決這個問題?我真的很感激任何幫助。
順便說一句,我使用nutch 1.11和solr 6.0.0。
感謝您的回覆Jorge。雖然你解釋了它的工作方式非常好,正如我在問題主體中提到的那樣,但我確實嘗試了這一點,但不幸的是它並沒有解決我的問題。 –