2015-05-14 81 views

回答

2

一個選項是使用Hadoop HDFS Java API。假設你正在使用maven,你將包括Hadoop的共同在你的pom.xml:

<dependency> 
    <groupId>org.apache.hadoop</groupId> 
    <artifactId>hadoop-common</artifactId> 
    <version>2.6.0.2.2.0.0-2041</version> 
</dependency> 

然後,在你的嘴實現你會使用HDFS文件系統對象。例如,下面是一些僞代碼,用於以字符串形式發送文件中的每行:

@Override 
public void nextTuple() { 
    try { 
     Path pt=new Path("hdfs://servername:8020/user/hdfs/file.txt"); 
     FileSystem fs = FileSystem.get(new Configuration()); 
     BufferedReader br = new BufferedReader(new InputStreamReader(fs.open(pt))); 
     String line = br.readLine(); 
     while (line != null){ 
     System.out.println(line); 
     line=br.readLine(); 
     // emit the line which was read from the HDFS file 
     // _collector is a private member variable of type SpoutOutputCollector set in the open method; 
     _collector.emit(new Values(line)); 
     } 
    } catch (Exception e) { 
     _collector.reportError(e); 
     LOG.error("HDFS spout error {}", e); 
    } 
} 
+0

謝謝Kit!這確實是單個文件逐個流式化元組的解決方案。怎麼樣批量元組(還是風暴三叉戟)的噴口? – florins

+0

@florins自己並沒有嘗試三叉戟,但它看起來像你會實現[IBatchSpout](https://nathanmarz.github.io/storm/doc/storm/trident/spout/IBatchSpout.html),然後你的代碼會去在emitBatch而不是nextTuple中。 –