在Pyspark

替換值的字符串在數據幀我有一些屬性的數據幀，它有下一·外觀：在Pyspark

+-------+-------+ 
| Atr1 | Atr2 | 
+-------+-------+ 
| 3,06 | 4,08 | 
| 3,03 | 4,08 | 
| 3,06 | 4,08 | 
| 3,06 | 4,08 | 
| 3,06 | 4,08 | 
| ... | ... | 
+-------+-------+

正如你所看到的，數據幀的ATR1和ATR2的值是數字具有'，'字符。這是因爲我已經從CSV中加載了那些DoubleType數字的小數由'，'表示的數據。

當我將數據加載到數據幀中的值轉換爲字符串，所以我申請鑄件從字符串到DoubleType這些屬性是這樣的：

df = df.withColumn("Atr1", df["Atr1"].cast(DoubleType())) 
df = df.withColumn("Atr2", df["Atr2"].cast(DoubleType()))

但是，當我這樣做，值轉換爲空

+-------+-------+ 
| Atr1 | Atr2 | 
+-------+-------+ 
| null | null | 
| null | null | 
| null | null | 
| null | null | 
| null | null | 
| ... | ... | 
+-------+-------+

我想這是因爲DoubleType小數必須用'。'分隔。而不是'，'。但是我沒有機會編輯CSV文件，所以我想用'。'替換數據框中的'，'符號。然後將該投射應用於DoubleType。

我該怎麼辦？

來源

2017-07-11 jartymcfly

您可以通過使用用戶定義的函數簡單地解決此問題。

from pyspark.sql.functions import UserDefinedFunction 
from pyspark.sql.functions import * 

data = [Row(Atr1="3,06", Atr2="4,08"), 
     Row(Atr1="3,06", Atr2="4,08"), 
     Row(Atr1="3,06", Atr2="4,08")] 

df = sqlContext.createDataFrame(data) 

# Create an user defined function to replace ',' for '.' 
udf = UserDefinedFunction(lambda x: x.replace(",","."), StringType()) 

out = df 
    .withColumn("Atr1", udf(col("Atr1")).cast(DoubleType())) 
    .withColumn("Atr2", udf(col("Atr2")).cast(DoubleType())) 

############################################################## 
out.show() 

+----+----+ 
|Atr1|Atr2| 
+----+----+ 
|3.06|4.08| 
|3.06|4.08| 
|3.06|4.08| 
+----+----+ 

############################################################## 

out.printSchema() 

root 
|-- Atr1: double (nullable = true) 
|-- Atr2: double (nullable = true)

編輯：更多從如下意見建議緊湊的解決方案。

from pyspark.sql.functions import UserDefinedFunction 
from pyspark.sql.functions import * 

udf = UserDefinedFunction(lambda x: float(x.replace(",",".")), DoubleType()) 

out = df 
    .withColumn("Atr1", udf(col("Atr1"))) 
    .withColumn("Atr2", udf(col("Atr2")))

來源

2017-07-11 10:43:25 Luis

太棒了！感謝您的明確答案！ – jartymcfly

事情和我在想什麼一樣。你可以通過做'lambda x：float（x.replace（'，'，'。'）），DoubleType（））''來跳過整個'.cast'部分嗎？ – Adam

好的建議！更緊湊 – Luis

你也可以用SQL來做。

val df = sc.parallelize(Array(
     ("3,06", "4,08"), 
     ("3,06", "4,08"), 
     ("3,06", "4,08"), 
     ("3,06", "4,08"), 
     ("3,06", "4,08"), 
     ("3,06", "4,08"), 
     ("3,06", "4,08"), 
     ("3,06", "4,08") 
    )).toDF("a", "b") 

df.registerTempTable("test") 

val doubleDF = sqlContext.sql("select cast(trim(regexp_replace(a , ',' , '.')) as double) as a from test ") 

doubleDF.show 
+----+ 
| a| 
+----+ 
|3.06| 
|3.06| 
|3.06| 
|3.06| 
|3.06| 
|3.06| 
|3.06| 
|3.06| 
+----+ 

doubleDF.printSchema 
root 
|-- a: double (nullable = true)

來源

2017-07-11 10:55:40 philantrovert

讓我們假設你有：

sdf.show() 
+-------+-------+ 
| Atr1| Atr2| 
+-------+-------+ 
| 3,06 | 4,08 | 
| 3,03 | 4,08 | 
| 3,06 | 4,08 | 
| 3,06 | 4,08 | 
| 3,06 | 4,08 | 
+-------+-------+

然後將下面的代碼將產生期望的結果：

strToDouble = udf(lambda x: float(x.replace(",",".")), DoubleType()) 

sdf = sdf.withColumn("Atr1", strToDouble(sdf['Atr1'])) 
sdf = sdf.withColumn("Atr2", strToDouble(sdf['Atr2'])) 

sdf.show() 
+----+----+ 
|Atr1|Atr2| 
+----+----+ 
|3.06|4.08| 
|3.03|4.08| 
|3.06|4.08| 
|3.06|4.08| 
|3.06|4.08| 
+----+----+

來源

2017-07-11 11:02:00

是可以通過列名作爲參數傳遞給山坳（）函數在您的示例代碼？類似這樣的：

# Create an user defined function to replace ',' for '.' 
udf = UserDefinedFunction(lambda x: x.replace(",","."), StringType()) 

col_name1 = "Atr1" 
col_name2 = "Atr2" 

out = df 
    .withColumn(col_name1, udf(col(col_name1)).cast(DoubleType())) 
    .withColumn(col_name2, udf(col(col_name2)).cast(DoubleType()))

來源

2017-10-25 22:13:32

回答

相關問題