從CSV文件

-1

我有一個具有兩列的火花數據幀將數據添加到現有的Apache火花數據框：姓名，年齡如下：從CSV文件

[Row(name=u'Alice', age=2), Row(name=u'Bob', age=5)]

數據幀是使用創建

sqlContext.createDataFrame()

接下來我需要做的是從外部'csv'文件添加第三列'UserId'。外部文件有幾個專欄，但我需要只包括第一列，這是「用戶ID」：

記錄在兩個數據源的數量是一樣的。我在Windows操作系統上使用獨立的pyspark版本。最終結果應該是一個包含三列的新數據框：UserId，Name，Age。

有什麼建議嗎？

來源

2016-09-16 Taie

我用熊貓做這個工作。它允許以許多不同的方式連接數據幀。

1）我們首先需要只導入額外的列（後除去頭，雖然這也可以在導入後進行），並將其轉換成一個RDD

from pyspark.sql.types import StringType 
from pyspark import SQLContext 
sqlContext = SQLContext(sc) 
userid_rdd = sc.textFile("C:……/userid.csv").map(lambda line: line.split(","))

2）轉換「用戶ID 'RDD成火花數據幀

userid_df = userid_rdd.toDF(['userid']) 
userid_df.show()

3）轉換 '用戶ID' 數據幀到數據幀大熊貓

userid_toPandas = userid_df.toPandas() 
userid_toPandas

4）轉換「預測」數據框（現有的數據幀）爲大熊貓數據幀

predictions_toPandas = predictions.toPandas() 
predictions_toPandas

5）使用「CONCAT」

import pandas as pd 
result = pd.concat([userid_toPandas, predictions_toPandas], axis = 1, ignore_index = True) 
result

來源

2016-09-16 18:01:03 Taie

您可以通過連接兩個數據框來完成此操作，但是您需要在展位表中使用ID或其他鍵。如果行的位置是相同的其他明智你沒有足夠的信息來合併它們，我建議只複製它到一個Excel文件。

來源

2016-09-16 15:19:27 Dima

您可以創建結合的兩隻大熊貓dataframes到一個新的數據幀來自csv的新數據幀。

sc = SparkContext.getOrCreate() 
    sqlContext = SQLContext(sc) 

    # Import the csv file to the SparkSQL table. 

    df = sqlContext.read.csv("abc.csv") 
    df.createOrReplaceTempView(table_a) 

    # Create a new dataframe with only the columns required. In your case only user id 
    df_1 = spark.sql("select userid from table_a") 

    #Now do a join with the existing dataframe which has the original data. ([Row(name=u'Alice', age=2), Row(name=u'Bob', age=5)]) 
    # Lets call the original alice-bob dataframe as df_ori. So, 

    df_result = df_ori.join(df_1, how=inner, on= (any column cols if there are any or index row)

來源

2017-08-19 07:21:18 Viv

回答

相關問題