從Spark中讀取多個json文件

我有一個我想要並行加載的json文件列表。從Spark中讀取多個json文件

我不能使用read.json("*")原因文件不在同一個文件夾中，並且沒有可以實現的特定模式。

我試過sc.parallelize(fileList).select(hiveContext.read.json)但是，如預期的，hive上下文在執行程序中不存在。

任何想法？

2016-04-25 Roman Kagan

看起來像我找到了解決辦法：

val text sc.textFile("file1,file2....") 
val df = sqlContext.read.json(text)

來源

2016-04-25 08:58:12

此外，您還可以指定目錄作爲參數：

cat 1.json 
{"x": 1.0, "y": 2.0} 
{"x": 1.5, "y": 1.0} 
sudo -u hdfs hdfs dfs -put 1.json /tmp/test 

cat 2.json 
{"x": 3.0, "y": 4.0} 
{"x": 1.8, "y": 7.0} 
sudo -u hdfs hdfs dfs -put 2.json /tmp/test 

sqlContext.read.json("/tmp/test").show() 
+---+---+ 
| x| y| 
+---+---+ 
|1.0|2.0| 
|1.5|1.0| 
|3.0|4.0| 
|1.8|7.0| 
+---+---+

來源

2016-04-25 10:25:08

肯定的，但它只能在情況下，所有文件都在同一個目錄下。如果我們在不同目錄中有多個文件，並行讀取它們的唯一方法 - 只能作爲文本文件 –

從Spark中讀取多個json文件

回答

相關問題