2016-05-30 68 views
1

獲取一個月,我有一個數據幀以「周」 &「年」列,需要計算月供下同:SPARK SQL:從週數和年

輸入:

+----+----+ |Week|Year| +----+----+ | 50|2012| | 50|2012| | 50|2012|

預期輸出:

+----+----+-----+ |Week|Year|Month| +----+----+-----+ | 50|2012|12 | | 50|2012|12 | | 50|2012|12 |

任何幫助,將不勝感激。由於

+1

什麼跨2個月跨越星期?不是一個月來推導出一個微弱的變量嗎? –

回答

1

感謝@ zero323,誰指出我出到sqlContext.sql查詢,我轉換的查詢如下所示:

import org.apache.spark.SparkConf; 
import org.apache.spark.api.java.JavaRDD; 
import org.apache.spark.api.java.JavaSparkContext; 
import org.apache.spark.sql.DataFrame; 
import org.apache.spark.sql.RowFactory; 
import org.apache.spark.sql.SQLContext; 
import org.apache.spark.sql.types.DataTypes; 
import org.apache.spark.sql.types.StructField; 
import org.apache.spark.sql.types.StructType; 

import java.util.ArrayList; 
import java.util.Arrays; 
import java.util.List; 

import static org.apache.spark.sql.functions.*; 

public class MonthFromWeekSparkSQL { 

    public static void main(String[] args) { 

     SparkConf conf = new SparkConf().setAppName("MonthFromWeekSparkSQL").setMaster("local"); 
     JavaSparkContext sc = new JavaSparkContext(conf); 
     SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc); 

     List myList = Arrays.asList(RowFactory.create(50, 2012), RowFactory.create(50, 2012), RowFactory.create(50, 2012)); 
     JavaRDD myRDD = sc.parallelize(myList); 

     List<StructField> structFields = new ArrayList<StructField>(); 

     // Create StructFields 
     StructField structField1 = DataTypes.createStructField("week", DataTypes.IntegerType, true); 
     StructField structField2 = DataTypes.createStructField("year", DataTypes.IntegerType, true); 

     // Add StructFields into list 
     structFields.add(structField1); 
     structFields.add(structField2); 

     // Create StructType from StructFields. This will be used to create DataFrame 
     StructType schema = DataTypes.createStructType(structFields); 

     DataFrame df = sqlContext.createDataFrame(myRDD, schema); 
     DataFrame df2 = df.withColumn("yearAndWeek", concat(col("year"), lit(" "), col("week"))) 
       .withColumn("month", month(unix_timestamp(col("yearAndWeek"), "yyyy w").cast(("timestamp")))).drop("yearAndWeek"); 

     df2.show(); 

    } 

} 

你居然用一年,周格式化爲創建新列「YYYY w「,然後使用unix_timestamp將其轉換爲可以從中看到的月份。

PS:看來,投行爲是火花1.5不正確 - 在這種情況下

因此,它是更普遍的做.cast("double").cast("timestamp")

+0

就我而言,它只是增加時間而不改變月份和年份。請看看gist https://gist.github.com/nareshbab/7d945ccaaae07ca743dec0ea07bb50c0 – nareshbabral

+0

你沒有正確複製代碼,所以請檢查你的代碼! – eliasah

+1

現在感謝它的工作 – nareshbabral