使用sparkR合併大數據集

我想知道sparkR是否更容易合併大數據集而不是「常規R」？我有12個csv文件，大約500,000行40列。這些文件是2014年的月度數據。我想爲2014年製作一個文件。這些文件都具有相同的列標籤，我希望按第一列（年）合併。但是，有些文件比其他文件有更多的行。使用sparkR合併大數據集

當我運行下面的代碼：

setwd("C:\\Users\\Anonymous\\Desktop\\Data 2014") 

file_list <- list.files() 

for (file in file_list){ 

# if the merged dataset doesn't exist, create it 
if (!exists("dataset")){ 
dataset <- read.table(file, header=TRUE, sep="\t") 
} 

# if the merged dataset does exist, append to it 
if (exists("dataset")){ 
temp_dataset <-read.table(file, header=TRUE, sep="\t") 
dataset<-rbind(dataset, temp_dataset) 
rm(temp_dataset) 
} 

}

[R墜毀。

當我運行這段代碼：

library(SparkR) 
library(magrittr) 
# setwd("C:\\Users\\Anonymous\\Desktop\\Data 2014\\Jan2014.csv") 
sc <- sparkR.init(master = "local") 
sqlContext <- sparkRSQL.init(sc) 

Jan2014_file_path <- file.path('Jan2014.csv') 

system.time(
housing_a_df <- read.df(sqlContext, 
         "C:\\Users\\Anonymous\\Desktop\\Data  2014\\Jan2014.csv", 
         header='true', 
         inferSchema='false') 
)

我得到了以下錯誤：

Error in invokeJava(isStatic = TRUE, className, methodName, ...) : 
    org.apache.spark.SparkException: Job aborted due to stage failure: Task 0  in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost):

那麼這將是在sparkR合併這些文件一個簡單的方法？

來源

2016-01-12 user21478

你讀過[this]（http://stackoverflow.com/questions/23169645/r-3-0-3-rbind-multiple-csv-files）嗎？在第一節中，'file_list'csv文件中的所有文件是？ –

你說你想「按第一列合併」，但在你的示例代碼中，你連接了來自不同文件的行。下面的答案（在撰寫本文時）是關於合併=連接，而不是連接。 – kasterma

請問下面有答案，回答你的問題？如果是，請接受答案。這可能有助於其他開發人員 – sag

你應該以這種格式讀取CSV文件：編號：https://gist.github.com/shivaram/d0cd4aa5c4381edd6f85

# Launch SparkR using 
# ./bin/sparkR --packages com.databricks:spark-csv_2.10:1.0.3 

# The SparkSQL context should already be created for you as sqlContext 
sqlContext 
# Java ref type org.apache.spark.sql.SQLContext id 1 

# Load the local CSV file using `read.df`. Note that we use the CSV reader Spark package here. 
Jan2014 <- read.df(sqlContext, "C:/Users/Anonymous/Desktop/Data 2014/Jan2014.csv", "com.databricks.spark.csv", header="true") 

Feb2014 <- read.df(sqlContext, "C:/Users/Anonymous/Desktop/Data 2014/Feb2014.csv", "com.databricks.spark.csv", header="true") 

#For merging/joining by year 

#join 
    jan_feb_2014 <- join(Jan2014 , Feb2014 , joinExpr = Jan2014$year == Feb2014$year1, joinType = "left_outer") 
# I used "left_outer", so i want all columns of Jan2014 and matching of columns Feb2014, based upon your requirement change the join type. 
#rename the Feb2014 column name year to year1, as it gets duplicated while joining. Then you can remove the column "jan_feb_2014$year1" after joining by the code, "jan_feb_2014$year1 <- NULL"

該如何通過一個文件加入一個。

來源

2016-01-12 04:56:42

是否將列添加到其他daraframe的數據框？由於他想合併兩個csv文件，我認爲加入可能不適合他 – sag

他想合併第一列「年」，所以我使用了加入。可能是他希望所有的月份都在列。@ SamuelAlexander –

一旦將文件作爲數據幀讀取，您可以使用SparkR的unionAll將數據幀合併到單個數據幀中。然後你可以把它寫入一個單獨的csv文件。

示例代碼

df1 <- read.df(sqlContext, "/home/user/tmp/test1.csv", source = "com.databricks.spark.csv") 
    df2 <- read.df(sqlContext, "/home/user/tmp/test2.csv", source = "com.databricks.spark.csv") 
    mergedDF <- unionAll(df1, df2) 
    write.df(mergedDF, "merged.csv", "com.databricks.spark.csv", "overwrite")

我測試過，並用它，但不反對你的尺寸的數據。但我希望這會幫助你

來源

2016-01-12 05:12:07 sag

使用sparkR合併大數據集

回答

相關問題