使用Spark列出Hadoop HDFS目錄中的所有文件？

我想遍歷Hadoop目錄中的所有文本文件並計算單詞「error」的所有出現次數。有沒有一種方法可以執行hadoop fs -ls /users/ubuntu/以使用Apache Spark Scala API列出目錄中的所有文件？使用Spark列出Hadoop HDFS目錄中的所有文件？

從給定的first example，火花背景下似乎只能訪問文件單獨經過是這樣的：

val file = spark.textFile("hdfs://target_load_file.txt")

在我的問題，我不知道有多少，也不在HDFS文件夾中的文件的事前名。看着spark context docs，但無法找到這種功能。

來源

2014-04-28 poliu2s

您可以使用通配符：

val errorCount = sc.textFile("hdfs://some-directory/*") 
        .flatMap(_.split(" ")).filter(_ == "error").count

來源

2014-04-30 12:48:52

如果我想報告已發生錯誤的文件的名稱？ –

使用'sc.wholeTextFiles'。看到http://stackoverflow.com/questions/29521665/how-to-map-filenames-to-rdd-using-sc-textfiles3n-bucket-csv幾乎是這個問題。 –

import org.apache.hadoop.fs.{FileSystem, FileUtil, Path} 
import scala.collection.mutable.Stack 


val fs = FileSystem.get(sc.hadoopConfiguration) 
var dirs = Stack[String]() 
val files = scala.collection.mutable.ListBuffer.empty[String] 
val fs = FileSystem.get(sc.hadoopConfiguration) 

dirs.push("/user/username/") 

while(!dirs.isEmpty){ 
    val status = fs.listStatus(new Path(dirs.pop())) 
    status.foreach(x=> if(x.isDirectory) dirs.push(x.getPath.toString) else 
    files+= x.getPath.toString) 
} 
files.foreach(println)

來源

2017-05-17 18:39:49

使用Spark列出Hadoop HDFS目錄中的所有文件？

回答

相關問題