Solr做網絡爬行嗎？

9

Solr本身不具有網絡抓取功能。

Nutch是Solr的「事實上的」爬蟲（然後是一些）。

來源

2009-11-23 05:30:13 mjv

20

Solr 5+事實上現在可以做網絡爬行了！ http://lucene.apache.org/solr/

較舊的Solr版本不會單獨進行網頁爬行，因爲歷史上它是一個提供全文搜索功能的搜索服務器。它建立在Lucene之上。

如果需要使用其他Solr的項目，那麼你有多種選擇，包括抓取網頁：

Nutch的 - http://lucene.apache.org/nutch/
Websphinx - http://www.cs.cmu.edu/~rcm/websphinx/
JSpider - http://j-spider.sourceforge.net/
Heritrix的 - http://crawler.archive.org/

如果你想要使用Lucene或SOLR提供的搜索工具，您需要從網頁抓取結果中創建索引。

見這也：

Lucene crawler (it needs to build lucene index)

來源

2009-11-23 05:35:59 Jon

+5

你能詳細說明«Solr 5+事實上現在可以做網絡爬行»嗎？在整個文檔中我沒有看到任何爬行功能。 – taharqa

0

防守Nutch的！ Nutch也有一個基本的網絡前端，它可以讓你查詢你的搜索結果。根據您的要求，您甚至可能不需要打擾SOLR。如果您使用Nutch/SOLR組合，您應該能夠利用最近完成的工作來整合SOLR和Nutch ... http://issues.apache.org/jira/browse/NUTCH-442

來源

2009-11-23 05:45:59 wmitchell

1

我一直在使用Nutch和Solr在我的最新項目上，它似乎工作得很好很好。

如果您使用的是Windows機器，那麼我強烈建議您遵循Jason Riffel給出的'No cygwin'指示！

來源

2010-12-31 09:44:00

1

是的，我與其他職位同意在這裏，使用Apache Nutch的

斌/ Nutch的抓取網址-solr http://localhost:8983/solr/ -depth 3 -topN 5

雖然你的Solr的版本有比賽Nutch的正確版本，因爲舊版本的Solr的存儲指數以不同的格式

其教程： http://wiki.apache.org/nutch/NutchTutorial

來源

2011-09-30 14:23:00 Joyce

2

你可能也想看看

http://www.crawl-anywhere.com/

非常強大的爬蟲是使用Solr兼容。

來源

2011-10-02 15:05:43

1

我知道它已經有一段時間，但如果別人正在搜索Solr的履帶和我一樣，有一個名爲Norconex HTTP Collector

來源

2015-05-14 17:27:48 Loransian

3

一個新的開源爬蟲的Solr 5開始支持簡單的webcrawling（Java Doc）。如果想搜索，Solr是工具，如果你想抓取，Nutch/Scrapy更好:)

要想得到它並運行，你可以看看here。然而，這裏是如何得到它，並在同一行運行：

java 
-classpath <pathtosolr>/dist/solr-core-5.4.1.jar 
-Dauto=yes 
-Dc=gettingstarted  -> collection: gettingstarted 
-Ddata=web    -> web crawling and indexing 
-Drecursive=3   -> go 3 levels deep 
-Ddelay=0    -> for the impatient use 10+ for production 
org.apache.solr.util.SimplePostTool -> SimplePostTool 
http://datafireball.com/  -> a testing wordpress blog

這裏的爬蟲是非常「幼稚」，在這裏您可以找到this的Apache Solr的GitHub庫的所有代碼。

下面是響應的樣子：

SimplePostTool version 5.0.0 
Posting web pages to Solr url http://localhost:8983/solr/gettingstarted/update/extract 
Entering auto mode. Indexing pages with content-types corresponding to file endings xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log 
SimplePostTool: WARNING: Never crawl an external web site faster than every 10 seconds, your IP will probably be blocked 
Entering recursive mode, depth=3, delay=0s 
Entering crawl at level 0 (1 links total, 1 new) 
POSTed web resource http://datafireball.com (depth: 0) 
Entering crawl at level 1 (52 links total, 51 new) 
POSTed web resource http://datafireball.com/2015/06 (depth: 1) 
... 
Entering crawl at level 2 (266 links total, 215 new) 
... 
POSTed web resource http://datafireball.com/2015/08/18/a-few-functions-about-python-path (depth: 2) 
... 
Entering crawl at level 3 (846 links total, 656 new) 
POSTed web resource http://datafireball.com/2014/09/06/node-js-web-scraping-using-cheerio (depth: 3) 
SimplePostTool: WARNING: The URL http://datafireball.com/2014/09/06/r-lattice-trellis-another-framework-for-data-visualization/?share=twitter returned a HTTP result status of 302 
423 web pages indexed. 
COMMITting Solr index changes to http://localhost:8983/solr/gettingstarted/update/extract... 
Time spent: 0:05:55.059

最後，你可以看到所有的數據都是正確的索引。

來源

2016-02-20 16:44:35

Solr做網絡爬行嗎？

回答

相關問題