2017-03-07 53 views
-1

我想連接2個基於df1.portfolio名稱的數據幀到df2.portId 生成的數據幀我不想重複相同的密鑰。使用左外連接的火花不能加入數據幀

這裏是我到目前爲止的代碼

val df = spark.read.json("C:\\json\\portmast") 
val pgetsec = spark.read.json("C:\\json\\pgetsec") 


val portfolio_master = df.select("PortfolioCode","Legal Entity Name","Asofdate") 
val pgetsecs= pgetsec.select("TransId", "SecId","portId","GaapCurBkBal","ParBal","SetlDt","SetlPric","OrgBkBal","TradeDt","StatCurBkBal","NaicRtg","SecurityTypeCode","CamraSecType","FundType","CountryIso") 
val pg = portfolio_master.join(pgetsec,Seq("PortfolioCode","portId"),"left_outer") 

我得到的錯誤是
Exception in thread "main" org.apache.spark.sql.AnalysisException: using columns ['PortfolioCode,'portId] can not be resolved given input columns: 最終JSON應該是這樣的

|-- Portfolio Code: string (nullable = true) 
|-- Legal Entity Name: string (nullable = true) 
|-- Asofdate: string (nullable = true) 

((SI, S&P 500 Index,9/30/2016),[0.0,Equity,Common Stock]) 
((SI, S&P 500 Index,9/30/2016),[0.0,Equity,Common Stock]) 
((SI, S&P 500 Index,9/30/2016),[0.0,Equity,Common Stock]) 
[SI1, S&P 500 Index,9/30/2016,CompactBuffer([0.0,Equity,Common  Stock], [0.0,Equity,Common Stock], [0.0,Equity,Common Stock])] 
root 
|-- Portfolio Code: string (nullable = true) 
|-- Legal Entity Name: string (nullable = true) 
|-- Asofdate: string (nullable = true) 
|-- Security: array (nullable = true) 
| |-- element: struct (containsNull = true) 
| | |-- BondPrice: double (nullable = true) 
| | |-- CoreSectorLevel1Code: string (nullable = true) 
| | |-- CoreSectorLevel2Code: string (nullable = true) 

+--------------+-------------------+---------+--------------------+ 
|Portfolio Code| Legal Entity Name| Asofdate|   Security| 
+--------------+-------------------+---------+--------------------+ 
|   SI | S&P 500 Index  |9/30/2016|[[0.0,Equity,Comm...| 
+--------------+-------------------+---------+--------------------+ 

任何幫助表示讚賞。

+0

您嘗試加入的列在第二個DataFrame中不存在? – eliasah

+0

兩者都是Json文件,我在文件1中讀取它稱爲portfolioCode,在第二個Json文件中稱爲portId。鍵背後的數據是相同的我想要做這樣的事情選擇p.portfoliocode,...,ps.secid,ps.Transid,... from portfolio_master p在p.portfoliocode上添加pgetsec ps = ps.portid – user2315840

回答

1

portfolio_master中不存在PortfolioCode不存在於pgetsec中。如果您重新閱讀完整的錯誤消息,您會看到它解釋了這一點,因爲它還顯示可用的列。

你想要的是portfolio_master("PortfolioCode") === pgetsec("portId")作爲你的連接條件。

+0

我想創建生成的json,以便不重複每個記錄的關鍵信息,所以它看起來像這樣 – user2315840

+0

然後,您可以在聯接之後刪除鍵列之一或在聯接之前重命名它們,以便它們相同像你之前那樣的加入。 – puhlen

+0

更新問題以顯示最終的json模式以及應該如何查看數據 – user2315840