如何使用Apache Nutch抓取.pdf鏈接

我收到了一個包含一些pdf文件鏈接的網站。我希望nutch抓取該鏈接並將它們轉儲爲.pdf文件。我使用的Apache Nutch1.6也是我在Java作爲如何使用Apache Nutch抓取.pdf鏈接

ToolRunner.run(NutchConfiguration.create(), new Crawl(), 
           tokenize(crawlArg)); 
SegmentReader.main(tokenize(dumpArg));

特林這可以有人幫助我在此

來源

2013-07-03 sudheer

-1

你可以編寫自己的自己的插件，爲PDF MIME類型
或有嵌入式Apache的蒂卡分析器，可以從PDF文本檢索..

來源

2013-10-10 06:41:22 olzhas

如果你想Nutch的抓取和索引你的PDF文檔，您必須啓用文檔爬行和提卡插件：

文獻爬行

1.1編輯正則表達式-urlfilter.txt並刪除「PDF」

# skip image and other suffixes we can't yet parse 
# for a more extensive coverage use the urlfilter-suffix plugin 
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

1.2編輯後綴urlfilter.txt的任何發生和刪除「PDF」

的任何occurence 1.3編輯Nutch的-site.xml中，增加了「解析 - 蒂卡」和「語法分析HTML」中的plugin.includes部分

<property> 
    <name>plugin.includes</name> 
    <value>protocol-http|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value> 
    <description>Regular expression naming plugin directory names to 
    include. Any plugin not matching this expression is excluded. 
    In any case you need at least include the nutch-extensionpoints plugin. By 
    default Nutch includes crawling just HTML and plain text via HTTP, 
    and basic indexing and search plugins. In order to use HTTPS please enable 
    protocol-httpclient, but be aware of possible intermittent problems with the 
    underlying commons-httpclient library. 
    </description> 
</property>

如果重新什麼盟友希望從一個頁面下載所有的PDF文件，你可以在* nix中使用類似Teleport in Windows或Wget的東西。

來源

2013-10-12 15:06:07 nimeshjm

如何使用Apache Nutch抓取.pdf鏈接

回答

相關問題