Solr索引空後nutch solrindex命令

我正在使用Nutch和Solr索引文件共享。Solr索引空後nutch solrindex命令

我第一個問題：斌/ Nutch的抓取網址

這給了我：

solrUrl is not set, indexing will be skipped... 
crawl started in: crawl-20110804191414 
rootUrlDir = urls 
threads = 10 
depth = 5 
solrUrl=null 
Injector: starting at 2011-08-04 19:14:14 
Injector: crawlDb: crawl-20110804191414/crawldb 
Injector: urlDir: urls 
Injector: Converting injected urls to crawl db entries. 
Injector: Merging injected urls into crawl db. 
Injector: finished at 2011-08-04 19:14:16, elapsed: 00:00:02 
Generator: starting at 2011-08-04 19:14:16 
Generator: Selecting best-scoring urls due for fetch. 
Generator: filtering: true 
Generator: normalizing: true 
Generator: jobtracker is 'local', generating exactly one partition. 
Generator: Partitioning selected urls for politeness. 
Generator: segment: crawl-20110804191414/segments/20110804191418 
Generator: finished at 2011-08-04 19:14:20, elapsed: 00:00:03 
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. 
Fetcher: starting at 2011-08-04 19:14:20 
Fetcher: segment: crawl-20110804191414/segments/20110804191418 
Fetcher: threads: 10 
QueueFeeder finished: total 1 records + hit by time limit :0 
-finishing thread FetcherThread, activeThreads=9 
-finishing thread FetcherThread, activeThreads=8 
-finishing thread FetcherThread, activeThreads=7 
-finishing thread FetcherThread, activeThreads=6 
-finishing thread FetcherThread, activeThreads=5 
-finishing thread FetcherThread, activeThreads=4 
-finishing thread FetcherThread, activeThreads=3 
-finishing thread FetcherThread, activeThreads=2 
-finishing thread FetcherThread, activeThreads=1 
fetching file:///mnt/public/Personal/Reminder Building Security.htm 
-finishing thread FetcherThread, activeThreads=0 
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 
-activeThreads=0 
Fetcher: finished at 2011-08-04 19:14:22, elapsed: 00:00:02 
ParseSegment: starting at 2011-08-04 19:14:22 
ParseSegment: segment: crawl-20110804191414/segments/20110804191418 
ParseSegment: finished at 2011-08-04 19:14:23, elapsed: 00:00:01 
CrawlDb update: starting at 2011-08-04 19:14:23 
CrawlDb update: db: crawl-20110804191414/crawldb 
CrawlDb update: segments: [crawl-20110804191414/segments/20110804191418] 
CrawlDb update: additions allowed: true 
CrawlDb update: URL normalizing: true 
CrawlDb update: URL filtering: true 
CrawlDb update: Merging segment data into db. 
CrawlDb update: finished at 2011-08-04 19:14:24, elapsed: 00:00:01 
Generator: starting at 2011-08-04 19:14:24 
Generator: Selecting best-scoring urls due for fetch. 
Generator: filtering: true 
Generator: normalizing: true 
Generator: jobtracker is 'local', generating exactly one partition. 
Generator: 0 records selected for fetching, exiting ... 
Stopping at depth=1 - no more URLs to fetch. 
LinkDb: starting at 2011-08-04 19:14:25 
LinkDb: linkdb: crawl-20110804191414/linkdb 
LinkDb: URL normalize: true 
LinkDb: URL filter: true 
LinkDb: adding segment: file:/home/nutch/nutch-1.3/runtime/local/crawl-20110804191414/segments/20110804191418 
LinkDb: finished at 2011-08-04 19:14:26, elapsed: 00:00:01 
crawl finished: crawl-20110804191414

然後我：斌/ Nutch的solrindex http://localhost:8983/solr/爬行-20110804191414/crawldb爬行-20110804191414/linkdb爬行-20110804191414 /段/ *

這給了我：

SolrIndexer: starting at 2011-08-04 19:17:07 
SolrIndexer: finished at 2011-08-04 19:17:08, elapsed: 00:00:01

當我做了：上Solr的查詢我得到：

<response> 
    <lst name="responseHeader"> 
      <int name="status">0</int> 
      <int name="QTime">2</int> 
      <lst name="params"> 
       <str name="indent">on</str> 
       <str name="start">0</str> 
       <str name="q">*:*</str> 
       <str name="version">2.2</str> 
       <str name="rows">10</str> 
      </lst> 
    </lst> 
    <result name="response" numFound="0" start="0"/> 
</response>

注意，這工作得很好，當我試圖使用協議HTTP抓取一個網站，但不工作的時候我使用協議文件來抓取文件系統。

---編輯--- 今天再次嘗試這個之後，我注意到名稱中有空格的文件導致了404錯誤。這是我索引的份額上的很多文件。但是，thumbs.db文件使它成功。這告訴我這個問題不是我認爲的問題。

來源

2011-08-05 Seth Griffin

我也做了段轉儲，發現PDF文本內容正在編入索引，這是很棒的，因爲這就是我需要的。我不明白爲什麼solr沒有被更新所有的數據。 –

我也嘗試索引一個單一的PDF文件重命名爲只有一個詞。段數據在那裏，文本被解析出來，但是在執行bin/nutch solrindex之後沒有在solr中顯示搜索結果... –

仍然無法解決此問題。我已經就這個問題向Apache提出了一個問題。它似乎至少分配了一個開發者： https://issues.apache.org/jira/browse/NUTCH-1076 –

我已經花了很多今天回顧你的步驟。我最終在/opt/nutch/src/java/org/apache/nutch/indexer/IndexerMapReduce.java中使用了printf調試，它向我展示了每次嘗試索引的URL都會出現兩次，一次以file：//開頭/ var/www/Engineering /，正如我最初指定的那樣，並且一旦以file：/ u/u60/Engineering /開頭。在這個系統上，/ var/www/Engineering是/ u/u60/Engineering的一個符號鏈接。此外，/ var/www/Engineering URL被拒絕，因爲未提供parseText字段，並且/ u/u60/Engineering URL由於未提供fetchDatum字段而被拒絕。在/ u/u60/Engineering表單中指定原始URL解決了我的問題。希望在這種情況下，幫助下一步。

來源

2013-01-29 07:01:16

這是因爲solr沒有獲取要索引的數據。似乎你沒有正確執行先前的命令。重新啓動整個過程，然後嘗試最後一條命令。從這裏複製命令：https://wiki.apache.org/nutch/NutchTutorial或在YouTube上引用我的視頻 - https://www.youtube.com/watch?v=aEap3B3M-PU&t=449s

來源

2017-04-09 14:53:16

Solr索引空後nutch solrindex命令

回答

相關問題