我用nutch做了一些實驗來抓取沒有任何ajax調用的網站，並且我得到了所有的數據。Nutch 2.x沒有抓取像flipkart和jabong這樣的網站

我執行以下步驟來獲取數據。

用戶@本地：〜/樣品/ Nutch的/運行/ local/bin目錄$ ./nutch注入/path/to/the/seed.txt
$：./nutch產生-batchId 321
$：./nutch取321
$：./nutch解析321
$：./nutch updatedb的

我的HBase作爲存儲在HDFS文件的存儲。如果我執行這些5個步驟它給了我所有的數據，如果該URL是http://www.naaptol.com/brands/nokia/mobile-phones.html但如果我將其更改爲http://www.flipkart.com/mens-footwear/shoes/sports-shoes/pr?sid=osp,cil,nit,1cu&otracker=hp_nmenu_sub_men_0_Sports%20Shoes它給我什麼

我的nutch-site.xml文件：

<?xml version="1.0"?> 
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> 

<!-- Put site-specific property overrides in this file. --> 

<configuration> 
     <property> 
       <name>storage.data.store.class</name> 
       <value>org.apache.gora.hbase.store.HBaseStore</value> 
       <description>Default class for storing data</description> 
     </property> 
     <property> 
       <name>http.agent.name</name> 
       <value>com.datametica.agent</value> 
       <description>this is just an agent name</description> 
     </property> 
     <property> 
       <name>http.robots.agents</name> 
       <value>datametica_robot</value> 
       <description>this is just a robot</description> 
     </property> 
     <property> 
       <name>plugin.folders</name> 
       <value>/home/sachin/source_codes/svn/nutch/nutch_2.x/build/plugins</value> 
     </property> 
</configuration>

來源

2014-07-13 saching

的regex- urlfilter塊網址具有查詢字符串參數：

跳過包含某些字符作爲可能的查詢網址等

[* @ =？！] 10

修改該文件，以便與查詢字符串參數抓取網址：

跳過含有特定字符作爲可能的查詢網址等

- [* @！]

Nutch的可能缺少爬行支持Ajax頁面。見this

你也許可以看看 https://issues.apache.org/jira/browse/NUTCH-1323

來源

2014-07-13 08:36:34

感謝的人它的工作原理，但我當我運行Nutch的是你獲得的數據，但在HTTP的情況下，而不是HTML多了一個問題：//www.flipkart。 com /男裝鞋/鞋/運動鞋/ pr？sid = osp，cil，nit，1cu＆otracker = hp_nmenu_sub_men_0_Sports％20鞋子，但它給我的html內容http://www.naaptol.com/brands/nokia/ mobile-phones.html讓我知道，如果你知道的東西。 – saching

Sachin，你必須接受別人給出的答案。你可以根據你的需要詢問很多問題。但是，如果你給那些已經回答你的問題的人提供一些信用，這將會很好。 –

Nutch 2.x沒有抓取像flipkart和jabong這樣的網站

回答

跳過包含某些字符作爲可能的查詢網址等

跳過含有特定字符作爲可能的查詢網址等

相關問題