正在配置nutch regex-normalize.xml

我正在使用基於Java的Nutch網絡搜索軟件。爲了防止在我的搜索查詢結果中返回重複（url）結果，我試圖在運行Nutch crawler索引我的Intranet時從要編入索引的URL中刪除（a.k.a. normalize）'jsessionid'的表達式。然而，我對$ NUTCH_HOME/conf/regex-normalize.xml的修改（在運行我的抓取之前）似乎沒有任何效果。正在配置nutch regex-normalize.xml

如何確保我的regex-normalize.xml配置正在進行抓取？並且，
什麼正則表達式可以在爬網/索引過程中成功地從url中刪除/規範化'jsessionid'的表達式？

以下是我目前的正則表達式，normalize.xml的內容：

<?xml version="1.0"?> 
<regex-normalize> 
<regex> 
<pattern>(.*);jsessionid=(.*)$</pattern> 
<substitution>$1</substitution> 
</regex> 
<regex> 
<pattern>(.*);jsessionid=(.*)(\&amp;|\&amp;amp;)</pattern> 
<substitution>$1$3</substitution> 
</regex> 
<regex> 
<pattern>;jsessionid=(.*)</pattern> 
<substitution></substitution> 
</regex> 
</regex-normalize>

這裏是我發出來運行我的（測試）「爬」的命令：

bin/nutch crawl urls -dir /tmp/test/crawl_test -depth 3 -topN 500

來源

2009-11-17 Anand Krishnan

你使用的是哪種版本的Nutch？我不熟悉Nutch，但Nutch 1.0的默認下載已經包含regex-normalize.xml中的一條規則，它似乎可以解決這個問題。

<!-- removes session ids from urls (such as jsessionid and PHPSESSID) --> 
<regex> 
    <pattern>([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|&amp;|#|$)</pattern> 
    <substitution>$4</substitution> 
</regex>

Btw。 正則表達式，urlfilter.txt似乎包含相關的東西太多

# skip URLs containing certain characters as probable queries, etc. 
-[?*[email protected]=]

再就是在Nutch的-default.xml中，你可能想看看

urlnormalizer.order 
urlnormalizer.regex.file 
plugin.includes

如果某些設置所有沒有幫助也許這樣做：How can I force fetcher to use custom nutch-config?

來源

2009-11-17 23:19:15 jitter

我使用Nutch版本0.8.1。這個版本在Nutch的-default.xml中進行以下設置： urlnormalizer.class ...而不是 urlnormalizer.order 我改變從org.apache.nutch.net.BasicUrlNormalizer價值ORG .apache.nutch.net.RegexUrlNormalizer。這是導致regex-normalize.xml文件在爬網時實際參與的原因。另外，添加以下插件到「插件-包括」值： urlnormalizer-（傳遞|正則表達式|基本）這是不包括在由在缺省0.8.1版本。非常感謝噓指着我在正確的方向！ – 2009-11-20 20:42:47

沒問題。現在考慮投票並接受我的答案 – jitter 2009-11-20 22:02:33

正在配置nutch regex-normalize.xml

回答

相關問題