Nutch Crawler花費很長時間

我只希望Nutch給我一個它爬行的URL列表以及該鏈接的狀態。我不需要整個頁面內容或絨毛。有什麼辦法可以做到這一點？爬取991個深度爲3的網址的種子列表需要3個多小時才能抓取和解析。我希望這會加快速度。Nutch Crawler花費很長時間

在Nutch的-default.xml中的文件有

<property> 
    <name>file.content.limit</name> 
    <value>65536</value> 
    <description>The length limit for downloaded content using the file 
    protocol, in bytes. If this value is nonnegative (>=0), content longer 
    than it will be truncated; otherwise, no truncation at all. Do not 
    confuse this setting with the http.content.limit setting. 
    </description> 
</property> 

<property> 
    <name>file.content.ignored</name> 
    <value>true</value> 
    <description>If true, no file content will be saved during fetch. 
    And it is probably what we want to set most of time, since file:// URLs 
    are meant to be local and we can always use them directly at parsing 
    and indexing stages. Otherwise file contents will be saved. 
    !! NO IMPLEMENTED YET !! 
    </description> 
</property> 

<property> 
    <name>http.content.limit</name> 
    <value>65536</value> 
    <description>The length limit for downloaded content using the http 
    protocol, in bytes. If this value is nonnegative (>=0), content longer 
    than it will be truncated; otherwise, no truncation at all. Do not 
    confuse this setting with the file.content.limit setting. 
    </description> 
</property> 

<property> 
    <name>ftp.content.limit</name> 
    <value>65536</value> 
    <description>The length limit for downloaded content, in bytes. 
    If this value is nonnegative (>=0), content longer than it will be truncated; 
    otherwise, no truncation at all. 
    Caution: classical ftp RFCs never defines partial transfer and, in fact, 
    some ftp servers out there do not handle client side forced close-down very 
    well. Our implementation tries its best to handle such situations smoothly. 
    </description> 
</property>

這些屬性是那些我認爲可能有事情做，但我不知道。有人能給我一些幫助和澄清？此外，我收到了很多狀態碼爲38的網址。我無法找到this文件中的狀態碼。謝謝您的幫助！

來源

2015-05-13 itsNino91

Nutch在獲取URL後執行解析，從獲取的URL獲取所有outlinks。來自URL的鏈接將用作下一輪的新查詢列表。

如果跳過解析，則不會生成新的URL，因此不會再提取。我能想到的一種方式是配置解析插件，只包含需要解析的內容類型（在你的情況下它是outlinks）。這裏一個例子 - https://wiki.apache.org/nutch/IndexMetatags

此鏈接描述解析器https://wiki.apache.org/nutch/Features

現在的特點，只得到網址獲取他們的狀態，你可以使用

$bin/nutch readdb crawldb -stats命令的列表。

關於38的狀態代碼，看你有聯繫的文件，好像URL的狀態是 public static final byte STATUS_FETCH_NOTMODIFIED = 0x26

因爲，十六進制（26）對應至12月（38）。

希望的答案給出了一些方向:)

來源

2015-05-15 12:35:42

哇，我不知道爲什麼我沒有想到從十六進制將其轉換爲十進制的身份ID，感謝那個的。我已經能夠通過增加我使用的線程來顯着加快速度。這已經減少了將同一爬網時間縮短到6分鐘的時間。然而額外的「絨毛」仍然存在。我無法弄清楚你在這裏描述的解析事情。在我的數據庫中，我只想看到2個字段;一個id（被測試的url本身）和狀態（在獲取之後是url的狀態）。 – itsNino91

'bin/nutch readdb crawldb -stats'命令僅顯示按狀態ID和每個URL的數量細分的總體統計數據。這不是我的最終目標，但它仍然是信息。 – itsNino91

如果您希望每個URL分解狀態。使用bin/nutch readdb crawldb -stats -sort –

Nutch Crawler花費很長時間

回答

相關問題