2017-07-11 99 views
0

替換值的字符串在數據幀我有一些屬性的數據幀,它有下一·外觀:在Pyspark

+-------+-------+ 
| Atr1 | Atr2 | 
+-------+-------+ 
| 3,06 | 4,08 | 
| 3,03 | 4,08 | 
| 3,06 | 4,08 | 
| 3,06 | 4,08 | 
| 3,06 | 4,08 | 
| ... | ... | 
+-------+-------+ 

正如你所看到的,數據幀的ATR1和ATR2的值是數字具有','字符。這是因爲我已經從CSV中加載了那些DoubleType數字的小數由','表示的數據。

當我將數據加載到數據幀中的值轉換爲字符串,所以我申請鑄件從字符串到DoubleType這些屬性是這樣的:

df = df.withColumn("Atr1", df["Atr1"].cast(DoubleType())) 
df = df.withColumn("Atr2", df["Atr2"].cast(DoubleType())) 

但是,當我這樣做,值轉換爲空

+-------+-------+ 
| Atr1 | Atr2 | 
+-------+-------+ 
| null | null | 
| null | null | 
| null | null | 
| null | null | 
| null | null | 
| ... | ... | 
+-------+-------+ 

我想這是因爲DoubleType小數必須用'。'分隔。而不是','。但是我沒有機會編輯CSV文件,所以我想用'。'替換數據框中的','符號。然後將該投射應用於DoubleType。

我該怎麼辦?

回答

4

您可以通過使用用戶定義的函數簡單地解決此問題。

from pyspark.sql.functions import UserDefinedFunction 
from pyspark.sql.functions import * 

data = [Row(Atr1="3,06", Atr2="4,08"), 
     Row(Atr1="3,06", Atr2="4,08"), 
     Row(Atr1="3,06", Atr2="4,08")] 

df = sqlContext.createDataFrame(data) 

# Create an user defined function to replace ',' for '.' 
udf = UserDefinedFunction(lambda x: x.replace(",","."), StringType()) 

out = df 
    .withColumn("Atr1", udf(col("Atr1")).cast(DoubleType())) 
    .withColumn("Atr2", udf(col("Atr2")).cast(DoubleType())) 

############################################################## 
out.show() 

+----+----+ 
|Atr1|Atr2| 
+----+----+ 
|3.06|4.08| 
|3.06|4.08| 
|3.06|4.08| 
+----+----+ 

############################################################## 

out.printSchema() 

root 
|-- Atr1: double (nullable = true) 
|-- Atr2: double (nullable = true) 

編輯: 更多從如下意見建議緊湊的解決方案。

from pyspark.sql.functions import UserDefinedFunction 
from pyspark.sql.functions import * 

udf = UserDefinedFunction(lambda x: float(x.replace(",",".")), DoubleType()) 

out = df 
    .withColumn("Atr1", udf(col("Atr1"))) 
    .withColumn("Atr2", udf(col("Atr2"))) 
+0

太棒了!感謝您的明確答案! – jartymcfly

+1

事情和我在想什麼一樣。你可以通過做'lambda x:float(x.replace(',','。')),DoubleType())''來跳過整個'.cast'部分嗎? – Adam

+0

好的建議!更緊湊 – Luis

0

你也可以用SQL來做。

val df = sc.parallelize(Array(
     ("3,06", "4,08"), 
     ("3,06", "4,08"), 
     ("3,06", "4,08"), 
     ("3,06", "4,08"), 
     ("3,06", "4,08"), 
     ("3,06", "4,08"), 
     ("3,06", "4,08"), 
     ("3,06", "4,08") 
    )).toDF("a", "b") 

df.registerTempTable("test") 

val doubleDF = sqlContext.sql("select cast(trim(regexp_replace(a , ',' , '.')) as double) as a from test ") 

doubleDF.show 
+----+ 
| a| 
+----+ 
|3.06| 
|3.06| 
|3.06| 
|3.06| 
|3.06| 
|3.06| 
|3.06| 
|3.06| 
+----+ 

doubleDF.printSchema 
root 
|-- a: double (nullable = true) 
1

讓我們假設你有:

sdf.show() 
+-------+-------+ 
| Atr1| Atr2| 
+-------+-------+ 
| 3,06 | 4,08 | 
| 3,03 | 4,08 | 
| 3,06 | 4,08 | 
| 3,06 | 4,08 | 
| 3,06 | 4,08 | 
+-------+-------+ 

然後將下面的代碼將產生期望的結果:

strToDouble = udf(lambda x: float(x.replace(",",".")), DoubleType()) 

sdf = sdf.withColumn("Atr1", strToDouble(sdf['Atr1'])) 
sdf = sdf.withColumn("Atr2", strToDouble(sdf['Atr2'])) 

sdf.show() 
+----+----+ 
|Atr1|Atr2| 
+----+----+ 
|3.06|4.08| 
|3.03|4.08| 
|3.06|4.08| 
|3.06|4.08| 
|3.06|4.08| 
+----+----+ 
0

是可以通過列名作爲參數傳遞給山坳()函數在您的示例代碼? 類似這樣的:

# Create an user defined function to replace ',' for '.' 
udf = UserDefinedFunction(lambda x: x.replace(",","."), StringType()) 

col_name1 = "Atr1" 
col_name2 = "Atr2" 

out = df 
    .withColumn(col_name1, udf(col(col_name1)).cast(DoubleType())) 
    .withColumn(col_name2, udf(col(col_name2)).cast(DoubleType()))