加入數據框火花java

首先，感謝您閱讀我的問題。加入數據框火花java

我的問題是如下：在Spark與Java，我加載兩個數據幀的兩個CSV文件的數據。

這些數據幀將具有以下信息。

數據幀機場

Id | Name | City 
----------------------- 
1 | Barajas | Madrid

數據幀airport_city_state

City | state 
---------------- 
Madrid | España

我想，這樣它看起來像這樣加入這兩個dataframes：

數據幀結果

Id | Name | City | state 
-------------------------- 
1 | Barajas | Madrid | España

其中dfairport.city = dfaiport_city_state.city

但我無法用語法澄清所以我可以正確地進行連接。我是如何創建的變量的一些代碼：

// Load the csv, you have to specify that you have header and what delimiter you have 
Dataset <Row> dfairport = Load.Csv (sqlContext, data_airport); 
Dataset <Row> dfairport_city_state = Load.Csv (sqlContext, data_airport_city_state); 


// Change the name of the columns in the csv dataframe to match the columns in the database 
// Once they match the name we can insert them 
Dfairport 
.withColumnRenamed ("leg_key", "id") 
.withColumnRenamed ("leg_name", "name") 
.withColumnRenamed ("leg_city", "city") 

dfairport_city_state 
.withColumnRenamed("city", "ciudad") 
.withColumnRenamed("state", "estado");

來源

2017-03-26 Alejandro Reina

首先，非常感謝您的回覆。

我已經試過我的兩個解決方案，但沒有他們的工作，我得到以下錯誤：方法dfairport_city_state（字符串）是未定義ETL_Airport

我無法訪問數據幀的特定列類型加入。

編輯：已經有了做加盟，我把這裏的情況下，其他人可以幫助解決;

感謝一切和問候

//Join de tablas en las que comparten ciudad 
Dataset <Row> joined = dfairport.join(dfairport_city_state, dfairport.col("leg_city").equalTo(dfairport_city_state.col("city")));

）

來源

2017-03-27 10:26:41

您可以使用join方法與列名連接兩個dataframes，如：

Dataset <Row> dfairport = Load.Csv (sqlContext, data_airport); 
Dataset <Row> dfairport_city_state = Load.Csv (sqlContext, data_airport_city_state); 

Dataset <Row> joined = dfairport.join(dfairport_city_state, dfairport_city_state("City"));

還有一個重載的版本，它允許你指定join類型作爲第三個參數，例如：

Dataset <Row> joined = dfairport.join(dfairport_city_state, dfairport_city_state("City"), "left_outer");

Here的更上連接。

來源

2017-03-26 20:07:13

加入數據框火花java

回答

相關問題