一個選項是使用Hadoop HDFS Java API。假設你正在使用maven,你將包括Hadoop的共同在你的pom.xml:
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.6.0.2.2.0.0-2041</version>
</dependency>
然後,在你的嘴實現你會使用HDFS文件系統對象。例如,下面是一些僞代碼,用於以字符串形式發送文件中的每行:
@Override
public void nextTuple() {
try {
Path pt=new Path("hdfs://servername:8020/user/hdfs/file.txt");
FileSystem fs = FileSystem.get(new Configuration());
BufferedReader br = new BufferedReader(new InputStreamReader(fs.open(pt)));
String line = br.readLine();
while (line != null){
System.out.println(line);
line=br.readLine();
// emit the line which was read from the HDFS file
// _collector is a private member variable of type SpoutOutputCollector set in the open method;
_collector.emit(new Values(line));
}
} catch (Exception e) {
_collector.reportError(e);
LOG.error("HDFS spout error {}", e);
}
}
謝謝Kit!這確實是單個文件逐個流式化元組的解決方案。怎麼樣批量元組(還是風暴三叉戟)的噴口? – florins
@florins自己並沒有嘗試三叉戟,但它看起來像你會實現[IBatchSpout](https://nathanmarz.github.io/storm/doc/storm/trident/spout/IBatchSpout.html),然後你的代碼會去在emitBatch而不是nextTuple中。 –