2017-05-23 51 views
0

我在CentOS 6.7虛擬機上安裝了Apache Nutch,並將其配置爲將輸出保存到MongoDB中。MongoDB的Apache nutch爬蟲沒有獲取正確的URL

但問題是它沒有抓取正確的URL或者它沒有返回正確的URL。你認爲這可能是因爲網站的安全性。

我的conf /正則表達式,urlfilter.txt有以下條目:

# accept anything else 
+^http://*.* 

seed.txt(只是用於測試目的)具有

[[email protected] local]$ cat urls/seed.txt 
http://www.sears.com/ 

我下面的步驟進樣 - >生成 - >獲取 - >解析 - > updatedb的

[[email protected] local]$ bin/nutch inject urls/ 
InjectorJob: starting at 2017-05-23 18:26:08 
InjectorJob: Injecting urlDir: urls 
InjectorJob: Using class org.apache.gora.mongodb.store.MongoStore as the Gora storage class. 
InjectorJob: total number of urls rejected by filters: 0 
InjectorJob: total number of urls injected after normalization and filtering: 1 
Injector: finished at 2017-05-23 18:26:11, elapsed: 00:00:02 
[[email protected] local]$ bin/nutch generate -topN 80 
GeneratorJob: starting at 2017-05-23 18:26:17 
GeneratorJob: Selecting best-scoring urls due for fetch. 
GeneratorJob: starting 
GeneratorJob: filtering: true 
GeneratorJob: normalizing: true 
GeneratorJob: topN: 80 
GeneratorJob: finished at 2017-05-23 18:26:21, time elapsed: 00:00:03 
GeneratorJob: generated batch id: 1495581977-876634391 containing 1 URLs 
[[email protected] local]$ bin/nutch fetch -all 
FetcherJob: starting at 2017-05-23 18:26:32 
FetcherJob: fetching all 
FetcherJob: threads: 10 
FetcherJob: parsing: false 
FetcherJob: resuming: false 
FetcherJob : timelimit set for : -1 
Using queue mode : byHost 
Fetcher: threads: 10 
fetching https://www.facebook.com/LinioEcuador/ (queue crawl delay=5000ms) 
fetching https://www.clubpremier.com/mx/conocenos/niveles/ (queue crawl delay=5000ms) 
fetching https://twitter.com/LinioEcuador/ (queue crawl delay=5000ms) 
fetching https://www.instagram.com/clubpremier/ (queue crawl delay=5000ms) 
fetching https://reservaciones.clubpremier.com/profiles/itineraries.cfm (queue crawl delay=5000ms) 
fetching https://s3.amazonaws.com/club_premier/logo-cp.svg (queue crawl delay=5000ms) 
Fetcher: throughput threshold: -1 
Fetcher: throughput threshold sequence: 5 
QueueFeeder finished: total 49 records. Hit by time limit :0 
fetching https://www.facebook.com/clubpremiermexico (queue crawl delay=5000ms) 
fetching https://s3.amazonaws.com/club_premier/clubpremier-components-info/images/logo-cp.svg (queue crawl delay=5000ms) 
fetching https://twitter.com/clubpremier_mx (queue crawl delay=1000ms) 
10/10 spinwaiting/active, 4 pages, 0 errors, 0.8 1 pages/s, 1151 1151 kb/s, 40 URLs in 2 queues 
fetching https://www.clubpremier.com/mx/acumula/compra/multiplica-puntos-premier (queue crawl delay=5000ms) 
fetching https://reservaciones.clubpremier.com/travel/arc.cfm (queue crawl delay=5000ms) 
10/10 spinwaiting/active, 6 pages, 0 errors, 0.6 0 pages/s, 798 445 kb/s, 38 URLs in 1 queues 
fetching https://www.clubpremier.com/mx/acumula/compra/adquiere-puntos-premier/ (queue crawl delay=5000ms) 
10/10 spinwaiting/active, 7 pages, 0 errors, 0.5 0 pages/s, 606 223 kb/s, 37 URLs in 1 queues 
fetching https://www.clubpremier.com/mx/acumula/aerolineas/skyteam/ (queue crawl delay=5000ms) 

你可以看到生成的不是在所有涉及到的網址以上我想要抓取的網站。請幫我解決這個問題。

感謝, 希爾帕

回答

1

貌似URL過濾配置在WWW接受每一頁。如果意圖是限制抓取到域中sears.com頁面,這些規則可能看起來像

# allow pages in the domain sears.com 
+^https?://([a-z0-9]+\.)*sears\.com 
# skip anything else 
-.* 

也有看看下面的配置屬性:

<property> 
    <name>db.ignore.external.links</name> 
    <value>false</value> 
    <description>If true, outlinks leading from a page to external hosts or domain 
    will be ignored. This is an effective way to limit the crawl to include 
    only initially injected hosts, without creating complex URLFilters. 
    See 'db.ignore.external.links.mode'. 
    </description> 
</property> 

<property> 
    <name>db.ignore.external.links.mode</name> 
    <value>byHost</value> 
    <description>Alternative value is byDomain</description> 
</property>