2017-03-10 134 views
0

我有一個火花數據幀,結果,有兩個字符串列我想轉換爲數字:鑄造字符串爲int空問題

>>> results.show() 
+--------------------+-----------------+------------------------+ 
|  Hospital Name|HCAHPS Base Score|HCAHPS Consistency Score| 
+--------------------+-----------------+------------------------+ 
|"ADIRONDACK MEDIC...|    "43"|     "20"| 
|"BAYLOR MEDICAL C...|    "32"|     "20"| 
|"GOOD SHEPHERD ME...|    "25"|     "20"| 
|"GOOD SHEPHERD ME...|    "25"|     "20"| 
|"MASONIC HOME AND...| "Not Available"|   "Not Available"| 
|"ST HELENA HOSPITAL"|    "41"|     "20"| 
| "TOURO INFIRMARY"|    "15"|     "18"| 
|"WAHIAWA GENERAL ...|    "17"|     "10"| 
|"ANNA JAQUES HOSP...|    "27"|     "18"| 
| "CMC-BLUE RIDGE"|    "31"|     "18"| 
|"EVANSTON REGIONA...|    "15"|     "15"| 
|"OKLAHOMA SPINE H...|    "79"|     "20"| 
|"PICKENS COUNTY M...| "Not Available"|   "Not Available"| 
|"PORTNEUF MEDICAL...|    "11"|     "17"| 
|"PRESENCE SAINT J...|    "20"|     "17"| 
|"RIVERSIDE MEDICA...|    "39"|     "20"| 
|"RIVERSIDE MEDICA...|    "39"|     "20"| 
|"RIVERSIDE MEDICA...|    "39"|     "20"| 
|"SOUTH GEORGIA ME...| "3 out of 10"|     "24"| 
|"TAMPA GENERAL HO...|    "23"|     "16"| 
+--------------------+-----------------+------------------------+ 

嘗試這樣讓我空值的表:

>>> results2 = results.select(results["Hospital Name"], results["HCAHPS Base Score"].cast(pe()).alias("HCAHPS Base Score"), results["HCAHPS Consistency Score"].cast(IntegerType()).aHPS Consistency Score")) 
>>> results2.show() 
+--------------------+-----------------+------------------------+ 
|  Hospital Name|HCAHPS Base Score|HCAHPS Consistency Score| 
+--------------------+-----------------+------------------------+ 
|"ADIRONDACK MEDIC...|    null|     null| 
|"BAYLOR MEDICAL C...|    null|     null| 
|"GOOD SHEPHERD ME...|    null|     null| 
|"GOOD SHEPHERD ME...|    null|     null| 
|"MASONIC HOME AND...|    null|     null| 
|"ST HELENA HOSPITAL"|    null|     null| 
| "TOURO INFIRMARY"|    null|     null| 
|"WAHIAWA GENERAL ...|    null|     null| 
|"ANNA JAQUES HOSP...|    null|     null| 
| "CMC-BLUE RIDGE"|    null|     null| 
|"EVANSTON REGIONA...|    null|     null| 
|"OKLAHOMA SPINE H...|    null|     null| 
|"PICKENS COUNTY M...|    null|     null| 
|"PORTNEUF MEDICAL...|    null|     null| 
|"PRESENCE SAINT J...|    null|     null| 
|"RIVERSIDE MEDICA...|    null|     null| 
|"RIVERSIDE MEDICA...|    null|     null| 
|"RIVERSIDE MEDICA...|    null|     null| 
|"SOUTH GEORGIA ME...|    null|     null| 
|"TAMPA GENERAL HO...|    null|     null| 
+--------------------+-----------------+------------------------+ 

only showing top 20 rows 

是不是可以將字符串列轉換爲pyspark中的整數?

回答

4

首先你最好需要去除雙引號,那麼你應該能夠轉換爲IntegerType。你可以使用下面的udf來完成它。

>>> def stripDQ(string): 
... return string.replace('"', "") 
... 
>>> from pyspark.sql.functions import udf 
>>> from pyspark.sql.types import StringType, IntegerType 
>>> udf_stripDQ = udf(stripDQ, StringType()) 

我們將使用它..

您的實際數據框:現在

>>> results.show() 
+------------------+-----------------+------------------------+ 
|  Hospital Name|HCAHPS Base Score|HCAHPS Consistency Score| 
+------------------+-----------------+------------------------+ 
|"ADIRONDACK MEDIC"|    "43"|     "20"| 
|"BAYLOR MEDICAL C"|    "32"|     "20"| 
|"GOOD SHEPHERD ME"|    "25"|     "20"| 
|"GOOD SHEPHERD ME"|    "25"|     "20"| 
|"MASONIC HOME AND"| "Not Available"|   "Not Available"| 
+------------------+-----------------+------------------------+ 

,我們將使用我們的UDF從兩列去掉雙引號。

>>> results1 = results.withColumn("HCAHPS Base Score", udf_stripDQ(results["HCAHPS Base Score"])).withColumn("HCAHPS Consistency Score", udf_stripDQ(results["HCAHPS Consistency Score"])) 
>>> results1.show() 
+------------------+-----------------+------------------------+ 
|  Hospital Name|HCAHPS Base Score|HCAHPS Consistency Score| 
+------------------+-----------------+------------------------+ 
|"ADIRONDACK MEDIC"|    43|      20| 
|"BAYLOR MEDICAL C"|    32|      20| 
|"GOOD SHEPHERD ME"|    25|      20| 
|"GOOD SHEPHERD ME"|    25|      20| 
|"MASONIC HOME AND"| Not Available|   Not Available| 
+------------------+-----------------+------------------------+ 

現在轉換成integer:

>>> results2 = results1.select(results1["Hospital Name"], results1["HCAHPS Base Score"].cast(IntegerType()).alias("HCAHPS Base Score"), results1["HCAHPS Consistency Score"].cast(IntegerType()).alias("HPS Consistency Score")) 
>>> results2.show() 
+------------------+-----------------+---------------------+ 
|  Hospital Name|HCAHPS Base Score|HPS Consistency Score| 
+------------------+-----------------+---------------------+ 
|"ADIRONDACK MEDIC"|    43|     20| 
|"BAYLOR MEDICAL C"|    32|     20| 
|"GOOD SHEPHERD ME"|    25|     20| 
|"GOOD SHEPHERD ME"|    25|     20| 
|"MASONIC HOME AND"|    null|     null| 
+------------------+-----------------+---------------------+