2016-10-24 20 views
0

我需要以不區分大小寫的方式對行進行排序。Spark:是否有任何方法對大小寫不敏感的列進行排序(排序或orderBy)

我有數據是這樣的:

+---+---------------+--------------------+--------------------+------+--------------+ 
| id|  full_name|   job_title|    email|gender| ip_address| 
+---+---------------+--------------------+--------------------+------+--------------+ 
| 73|  Tina Mccoy|Desktop Support T...|[email protected]|Female| 23.196.170.54| 
| 74|  Lois Hart|  Food Chemist|[email protected]|Female| 145.52.30.236| 
| 75| Thomas Hall| Senior Developer| [email protected]| Male|76.255.197.231| 
| 76| Ernest Romero|    Teacher|[email protected]n....| Male| 99.21.57.239| 
| 77| Irene Bradley| Assistant Professor|[email protected]|Female| 16.51.179.230| 
| 78|Jacqueline Cruz|account Represent...|  [email protected]|Female| 167.49.98.213| 
| 79| Sara Martin|  Geologist IV| [email protected]|Female| 10.145.49.204| 
| 80| Johnny Bradley| Executive Secretary|[email protected]| Male| 138.251.4.102| 
| 81|  Fred Dean|Nuclear Power Eng...|[email protected]| Male| 173.10.122.12| 
| 82| Ralph Greene|  Senior Editor|[email protected]| Male| 57.230.33.105| 
+---+---------------+--------------------+--------------------+------+--------------+ 

,當我使用df.orderBy('job_title')排序是基於job_title。這是我得到的。

+---+---------------+--------------------+--------------------+------+--------------+ 
| id|  full_name|   job_title|    email|gender| ip_address| 
+---+---------------+--------------------+--------------------+------+--------------+ 
| 77| Irene Bradley| Assistant Professor|[email protected]|Female| 16.51.179.230| 
| 73|  Tina Mccoy|Desktop Support T...|[email protected]|Female| 23.196.170.54| 
| 80| Johnny Bradley| Executive Secretary|[email protected]| Male| 138.251.4.102| 
| 74|  Lois Hart|  Food Chemist|[email protected]|Female| 145.52.30.236| 
| 79| Sara Martin|  Geologist IV| [email protected]|Female| 10.145.49.204| 
| 81|  Fred Dean|Nuclear Power Eng...|[email protected]| Male| 173.10.122.12| 
| 75| Thomas Hall| Senior Developer| [email protected]| Male|76.255.197.231| 
| 82| Ralph Greene|  Senior Editor|[email protected]| Male| 57.230.33.105| 
| 76| Ernest Romero|    Teacher|[email protected]| Male| 99.21.57.239| 
| 78|Jacqueline Cruz|account Represent...|  [email protected]|Female| 167.49.98.213| 
+---+---------------+--------------------+--------------------+------+--------------+ 

但我需要的是

+---+---------------+--------------------+--------------------+------+--------------+ 
| id|  full_name|   job_title|    email|gender| ip_address| 
+---+---------------+--------------------+--------------------+------+--------------+ 
| 78|Jacqueline Cruz|account Represent...|  [email protected]|Female| 167.49.98.213| 
| 77| Irene Bradley| Assistant Professor|[email protected]|Female| 16.51.179.230| 
| 73|  Tina Mccoy|Desktop Support T...|[email protected]|Female| 23.196.170.54| 
| 80| Johnny Bradley| Executive Secretary|[email protected]| Male| 138.251.4.102| 
| 74|  Lois Hart|  Food Chemist|[email protected]|Female| 145.52.30.236| 
| 79| Sara Martin|  Geologist IV| [email protected]|Female| 10.145.49.204| 
| 81|  Fred Dean|Nuclear Power Eng...|[email protected]| Male| 173.10.122.12| 
| 75| Thomas Hall| Senior Developer| [email protected]| Male|76.255.197.231| 
| 82| Ralph Greene|  Senior Editor|[email protected]| Male| 57.230.33.105| 
| 76| Ernest Romero|    Teacher|[email protected]| Male| 99.21.57.239| 
+---+---------------+--------------------+--------------------+------+--------------+ 

回答

1

可以將計算表達式作爲參數傳遞給orderBy。因此你可以導入lower功能:

from pyspark.sql.functions import col, lower 

,並用它來包裝列名

df.orderBy(lower(col("job_title"))) 
1

簡單的解決方案將是該列創建job_title_lower_case欄然後進行排序。在最終結果中,只需放下新的列即可。

相關問題