我只希望Nutch給我一個它爬行的URL列表以及該鏈接的狀態。我不需要整個頁面內容或絨毛。有什麼辦法可以做到這一點?爬取991個深度爲3的網址的種子列表需要3個多小時才能抓取和解析。我希望這會加快速度。Nutch Crawler花費很長時間
在Nutch的-default.xml中的文件有
<property>
<name>file.content.limit</name>
<value>65536</value>
<description>The length limit for downloaded content using the file
protocol, in bytes. If this value is nonnegative (>=0), content longer
than it will be truncated; otherwise, no truncation at all. Do not
confuse this setting with the http.content.limit setting.
</description>
</property>
<property>
<name>file.content.ignored</name>
<value>true</value>
<description>If true, no file content will be saved during fetch.
And it is probably what we want to set most of time, since file:// URLs
are meant to be local and we can always use them directly at parsing
and indexing stages. Otherwise file contents will be saved.
!! NO IMPLEMENTED YET !!
</description>
</property>
<property>
<name>http.content.limit</name>
<value>65536</value>
<description>The length limit for downloaded content using the http
protocol, in bytes. If this value is nonnegative (>=0), content longer
than it will be truncated; otherwise, no truncation at all. Do not
confuse this setting with the file.content.limit setting.
</description>
</property>
<property>
<name>ftp.content.limit</name>
<value>65536</value>
<description>The length limit for downloaded content, in bytes.
If this value is nonnegative (>=0), content longer than it will be truncated;
otherwise, no truncation at all.
Caution: classical ftp RFCs never defines partial transfer and, in fact,
some ftp servers out there do not handle client side forced close-down very
well. Our implementation tries its best to handle such situations smoothly.
</description>
</property>
這些屬性是那些我認爲可能有事情做,但我不知道。有人能給我一些幫助和澄清?此外,我收到了很多狀態碼爲38的網址。我無法找到this文件中的狀態碼。謝謝您的幫助!
哇,我不知道爲什麼我沒有想到從十六進制將其轉換爲十進制的身份ID,感謝那個的。我已經能夠通過增加我使用的線程來顯着加快速度。這已經減少了將同一爬網時間縮短到6分鐘的時間。然而額外的「絨毛」仍然存在。我無法弄清楚你在這裏描述的解析事情。在我的數據庫中,我只想看到2個字段;一個id(被測試的url本身)和狀態(在獲取之後是url的狀態)。 – itsNino91
'bin/nutch readdb crawldb -stats'命令僅顯示按狀態ID和每個URL的數量細分的總體統計數據。這不是我的最終目標,但它仍然是信息。 – itsNino91
如果您希望每個URL分解狀態。使用bin/nutch readdb crawldb -stats -sort –