2012-07-29 44 views
0

我一直試圖抓取使用Nutch,但它似乎並沒有運行。我正在嘗試構建SOLR搜索網站並使用Nutch在Solr中進行爬網和索引編制。Nutch - 不抓取,說「在深度= 1停止 - 沒有更多的URL可以抓取」

最初有一些權限問題,但現在已經修復。我嘗試抓取的網址是http://172.30.162.202:10200/,該網址無法公開訪問。它是一個可從Solr服務器訪問的內部URL。我試着用Lynx瀏覽它。

下面給出的是從Nutch的命令的輸出:

[[email protected] local]$ ./bin/nutch crawl /home/abgu01/urls/url1.txt -dir /home/abgu01/crawl -depth 5 -topN 100 
log4j:ERROR setFile(null,true) call failed. 
java.io.FileNotFoundException: /opt/apache-nutch-1.4-bin/runtime/local/logs/hadoop.log (No such file or directory) 
     at java.io.FileOutputStream.open(Native Method) 
     at java.io.FileOutputStream.<init>(FileOutputStream.java:212) 
     at java.io.FileOutputStream.<init>(FileOutputStream.java:136) 
     at org.apache.log4j.FileAppender.setFile(FileAppender.java:290) 
     at org.apache.log4j.FileAppender.activateOptions(FileAppender.java:164) 
     at org.apache.log4j.DailyRollingFileAppender.activateOptions(DailyRollingFileAppender.java:216) 
     at org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:257) 
     at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:133) 
     at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:97) 
     at org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:689) 
     at org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:647) 
     at org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigurator.java:544) 
     at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:440) 
     at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:476) 
     at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:471) 
     at org.apache.log4j.LogManager.<clinit>(LogManager.java:125) 
     at org.slf4j.impl.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:73) 
     at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:242) 
     at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:254) 
     at org.apache.nutch.crawl.Crawl.<clinit>(Crawl.java:43) 
log4j:ERROR Either File or DatePattern options are not set for appender [DRFA]. 
solrUrl is not set, indexing will be skipped... 
crawl started in: /home/abgu01/crawl 
rootUrlDir = /home/abgu01/urls/url1.txt 
threads = 10 
depth = 5 
solrUrl=null 
topN = 100 
Injector: starting at 2012-07-27 15:47:00 
Injector: crawlDb: /home/abgu01/crawl/crawldb 
Injector: urlDir: /home/abgu01/urls/url1.txt 
Injector: Converting injected urls to crawl db entries. 
Injector: Merging injected urls into crawl db. 
Injector: finished at 2012-07-27 15:47:03, elapsed: 00:00:02 
Generator: starting at 2012-07-27 15:47:03 
Generator: Selecting best-scoring urls due for fetch. 
Generator: filtering: true 
Generator: normalizing: true 
Generator: topN: 100 
Generator: jobtracker is 'local', generating exactly one partition. 
Generator: Partitioning selected urls for politeness. 
Generator: segment: /home/abgu01/crawl/segments/20120727154705 
Generator: finished at 2012-07-27 15:47:06, elapsed: 00:00:03 
Fetcher: starting at 2012-07-27 15:47:06 
Fetcher: segment: /home/abgu01/crawl/segments/20120727154705 
Using queue mode : byHost 
Fetcher: threads: 10 
Fetcher: time-out divisor: 2 
QueueFeeder finished: total 1 records + hit by time limit :0 
Using queue mode : byHost 
Using queue mode : byHost 
fetching http://172.30.162.202:10200/ 
-finishing thread FetcherThread, activeThreads=1 
Using queue mode : byHost 
-finishing thread FetcherThread, activeThreads=1 
Using queue mode : byHost 
-finishing thread FetcherThread, activeThreads=1 
Using queue mode : byHost 
-finishing thread FetcherThread, activeThreads=1 
Using queue mode : byHost 
-finishing thread FetcherThread, activeThreads=1 
Using queue mode : byHost 
-finishing thread FetcherThread, activeThreads=1 
Using queue mode : byHost 
-finishing thread FetcherThread, activeThreads=1 
Using queue mode : byHost 
-finishing thread FetcherThread, activeThreads=1 
Using queue mode : byHost 
Fetcher: throughput threshold: -1 
-finishing thread FetcherThread, activeThreads=1 
Fetcher: throughput threshold retries: 5 
-finishing thread FetcherThread, activeThreads=0 
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 
-activeThreads=0 
Fetcher: finished at 2012-07-27 15:47:08, elapsed: 00:00:02 
ParseSegment: starting at 2012-07-27 15:47:08 
ParseSegment: segment: /home/abgu01/crawl/segments/20120727154705 
ParseSegment: finished at 2012-07-27 15:47:09, elapsed: 00:00:01 
CrawlDb update: starting at 2012-07-27 15:47:09 
CrawlDb update: db: /home/abgu01/crawl/crawldb 
CrawlDb update: segments: [/home/abgu01/crawl/segments/20120727154705] 
CrawlDb update: additions allowed: true 
CrawlDb update: URL normalizing: true 
CrawlDb update: URL filtering: true 
CrawlDb update: 404 purging: false 
CrawlDb update: Merging segment data into db. 
CrawlDb update: finished at 2012-07-27 15:47:10, elapsed: 00:00:01 
Generator: starting at 2012-07-27 15:47:10 
Generator: Selecting best-scoring urls due for fetch. 
Generator: filtering: true 
Generator: normalizing: true 
Generator: topN: 100 
Generator: jobtracker is 'local', generating exactly one partition. 
Generator: 0 records selected for fetching, exiting ... 
Stopping at depth=1 - no more URLs to fetch. 
LinkDb: starting at 2012-07-27 15:47:11 
LinkDb: linkdb: /home/abgu01/crawl/linkdb 
LinkDb: URL normalize: true 
LinkDb: URL filter: true 
LinkDb: adding segment: file:/home/abgu01/crawl/segments/20120727154705 
Exception in thread "main" java.io.IOException: Job failed! 
     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252) 
     at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175) 
     at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149) 
     at org.apache.nutch.crawl.Crawl.run(Crawl.java:143) 
     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) 
     at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) 

任何人都可以請建議,這可能是沒有運行爬行的原因嗎?無論depthtopN參數的值如何,它總是以「停止在深度= 1 - 不再訪問URL」結束。我認爲它的原因是(查看上面的輸出)Fetcher無法從URL中獲取任何內容。

任何輸入表示讚賞!

回答

0

一個網站可能會阻止通過robots.txt /或meta(name =「robots」content =「noindex」)標記進行抓取。請檢查。

PS。你的日誌不清楚: 1. java.io.FileNotFoundException:/opt/apache-nutch-1.4-bin/runtime/local/logs/hadoop.log(沒有這樣的文件或目錄)
2. solrUrl不是設置,索引將被跳過...