如何從nutch獲取html內容

-2

是的，有一種方法。看看cache.jsp，看看它如何顯示緩存的數據。

2011-03-08 17:19:09 millebii

1

試試這個：

public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags 
metaTags, DocumentFragment doc) 
{ 
Parse parse = parseResult.get(content.getUrl()); 
LOG.info("parse.getText: " +parse.getText()); 
return parseResult; 
}

然後檢查內容hadoop.log。

來源

2012-01-25 10:44:14

8

是的，你可以acutally出口抓取段的內容。這不是直截了當的，但它對我來說很好。首先，創建具有以下代碼的Java項目：

import org.apache.hadoop.conf.Configuration; 
import org.apache.hadoop.fs.FileSystem; 
import org.apache.hadoop.fs.Path; 
import org.apache.hadoop.io.SequenceFile; 
import org.apache.hadoop.io.Text; 
import org.apache.nutch.protocol.Content; 
import org.apache.nutch.util.NutchConfiguration; 

import java.io.File; 
import java.io.FileOutputStream; 

public class NutchSegmentOutputParser { 

public static void main(String[] args) { 

    if (args.length != 2) { 
     System.out.println("usage: segmentdir (-local | -dfs <namenode:port>) outputdir"); 
     return; 
    } 

    try { 
     Configuration conf = NutchConfiguration.create(); 
     FileSystem fs = FileSystem.get(conf); 


     String segment = args[0]; 

     File outDir = new File(args[1]); 
     if (!outDir.exists()) { 
      if (outDir.mkdir()) { 
       System.out.println("Creating output dir " + outDir.getAbsolutePath()); 
      } 
     } 

     Path file = new Path(segment, Content.DIR_NAME + "/part-00000/data"); 
     SequenceFile.Reader reader = new SequenceFile.Reader(fs, file, conf); 


     Text key = new Text(); 
     Content content = new Content(); 

     while (reader.next(key, content)) { 
      String filename = key.toString().replaceFirst("http://", "").replaceAll("/", "___").trim(); 

      File f = new File(outDir.getCanonicalPath() + "/" + filename); 
      FileOutputStream fos = new FileOutputStream(f); 
      fos.write(content.getContent()); 
      fos.close(); 
      System.out.println(f.getAbsolutePath()); 
     } 
     reader.close(); 
     fs.close(); 
    } catch (Exception e) { 
     e.printStackTrace(); 
    } 

}

}

我推薦使用Maven;添加下面的依賴關係：

 <dependency> 
     <groupId>org.apache.nutch</groupId> 
     <artifactId>nutch</artifactId> 
     <version>1.5.1</version> 
    </dependency> 

    <dependency> 
     <groupId>org.apache.hadoop</groupId> 
     <artifactId>hadoop-common</artifactId> 
     <version>0.23.1</version> 
    </dependency>

，並創建一個jar包（即NutchSegmentOutputParser.jar）

您需要在您的計算機上安裝的Hadoop。然後運行：

$/hadoop-dir/bin/hadoop --config \ 
NutchSegmentOutputParser.jar:~/.m2/repository/org/apache/nutch/nutch/1.5.1/nutch-1.5.1.jar \ 
NutchSegmentOutputParser nutch-crawled-dir/2012xxxxxxxxx/ outdir

其中的Nutch-爬-DIR/2012xxxxxxxxx /是你想提取（它包含「段」子目錄）和OUTDIR是輸出目錄的內容抓取的目錄。輸出文件名是從URI生成的，但是，斜線替換爲「_」。

希望它有幫助。

來源

2012-10-24 06:56:36 Habi

+0

大答案的Javadoc。我爲Spring Batch創建了SequenceFileReader， – 2015-05-06 10:52:41

0

它的超級基礎。

public ParseResult getParse(Content content) { 
    LOG.info("getContent: " + new String(content.getContent()));

Content對象有一個方法getContent（），它返回一個字節數組。只要讓Java用BA創建一個新的String（），並且你已經獲得了nutch所提取的原始html。

我使用Nutch的1.9

這裏有org.apache.nutch.protocol.Content https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/protocol/Content.html#getContent()

來源

2015-03-24 17:36:16

如何從nutch獲取html內容

回答

相關問題