2014-08-27 98 views
0

Nutch的大師,什麼Nutch的命令,我需要通過命令行調用,如果我更新URL過濾文本

如果我改變的文件,如的robots.txt,或正則表達式,urlfilter.txt和任何這樣的資源,我需要調用哪個命令?

我不確定從nutch的說明。我猜這是解析器工作,但我不確定。

卡爾蒂克

從指令

# echo " crawl one-step crawler for intranets" 
    echo " inject  inject new urls into the database" 
    echo " hostinject  creates or updates an existing host table from a text file" 
    echo " generate generate new batches to fetch from crawl db" 
    echo " fetch  fetch URLs marked during generate" 
    echo " parse  parse URLs marked during fetch" 
    echo " updatedb update web table after parsing" 
    echo " updatehostdb update host table after parsing" 
    echo " readdb  read/dump records from page database" 
    echo " readhostdb  display entries from the hostDB" 
    echo " elasticindex run the elasticsearch indexer" 
    echo " solrindex run the solr indexer on parsed batches" 
    echo " solrdedup remove duplicates from solr" 
    echo " parsechecker check the parser for a given url" 
    echo " indexchecker check the indexing filters for a given url" 
    echo " plugin  load a plugin and run one of its classes main()" 
    echo " nutchserver run a (local) Nutch server on a user defined port" 
    echo " junit   runs the given JUnit test" 
    echo " or" 
    echo " CLASSNAME run the class named CLASSNAME" 
    echo "Most commands print help when invoked w/o parameters." 

回答

1

如果你改變了正則表達式,urlfilter.txt文件,你需要更新Nutch的工作文件。這可以這樣來完成:

jar -uvf /usr/local/nutch-1.2/nutch-1.2.job <path to regex-urlfilter.txt>