應用StringIndexer PySpark數據框但按字母順序排列

如何在pyspark中應用索引器，但是按字母順序排列類別？應用StringIndexer PySpark數據框但按字母順序排列

我有我的索引值的字典，應用Stringindexer後，但我想命令它不同。

index_df = StringIndexer(inputCol="gender", outputCol="genderIndex") 

meta = [f.metadata for f in index_df.schema.fields if f.name == "genderIndex"] 
meta 
[{u'ml_attr': {u'name': u'genderIndex', 
    u'type': u'nominal', 
    u'vals': [u'Male', u'Female']}}] 

a=dict(enumerate(meta[0]["ml_attr"]["vals"])) 
a= 
{0: u'Male', 1: u'Female'}

但是，例如，我想女性爲0，並且如果它是一個，B，C

我想A = 0，B = 1，C = 2，等等。

來源

2017-08-25 Learner

StringIndexer根據標籤頻率爲列標籤提供索引。對於你的情況，認爲我們可能不得不編碼自定義變壓器來做到這一點。 – Suresh

我不知道您的用例，但如果您要將索引列保存到字典中，並且不打算將它用於ML管道，請訂購該列並執行密集排名。這可能會幫助你。 – Suresh

在spark 2.3.0中，spark的StringIndexer將獲得stringOrderType參數（related jira issue），但在< 2.3.0中，您將需要創建自定義轉換器。例如，您可以蓋特所有值，加上指數，初始DF加盟，像這樣：

from pyspark.sql.window import Window 

df = spark.createDataFrame([(10, 'b'), (20, 'b'), (30, 'c'), 
          (40, 'c'), (50, 'c'), (60, 'a')], ['col1', 'col2']) 
col2_index = df.select('col2').distinct() \ 
    .withColumn('col2Index', row_number().over(Window.orderBy('col2')) - 1) 
col2_index.show() 

+----+---------+ 
|col2|col2Index| 
+----+---------+ 
| a|  0| 
| b|  1| 
| c|  2| 
+----+---------+ 

df.join(col2_index, 'col2').show() 

+----+----+---------+ 
|col2|col1|col2Index| 
+----+----+---------+ 
| c| 30|  2| 
| c| 40|  2| 
| c| 50|  2| 
| b| 10|  1| 
| b| 20|  1| 
| a| 60|  0| 
+----+----+---------+

OR，如果你不關心已經變壓器字典創建的，你可以只使用dense_rank作爲@Suresh中發現評論：

df.withColumn('col2Index', dense_rank().over(Window.orderBy('col2')) - 1).show()

來源

2017-09-02 13:38:31 Mariusz

應用StringIndexer PySpark數據框但按字母順序排列

回答

相關問題