如何在Apache Spark中執行LabelEncoding或分類值

我有數據集包含字符串列。我如何編碼基於字符串的列，如我們在scikit-learn中所做的那樣LabelEncoder如何在Apache Spark中執行LabelEncoding或分類值

2015-06-01 Abhishek Choudhary

你想LabelEncoder或OneHotEncoder什麼？ – elyase

我會優先考慮LabelEncoder，但我不會拒絕HotEncoder，如果PySpark中可用的話 –

我們正在開發sparkit-learn，它旨在提供PySpark上的scikit-learn功能和API。您可以使用SparkLabelEncoder方式如下：

$ pip install sparkit-learn

>>> from splearn.preprocessing import SparkLabelEncoder 
>>> from splearn import BlockRDD 
>>> 
>>> data = ["paris", "paris", "tokyo", "amsterdam"] 
>>> y = BlockRDD(sc.parallelize(data)) 
>>> 
>>> le = SparkLabelEncoder() 
>>> le.fit(y) 
>>> le.classes_ 
array(['amsterdam', 'paris', 'tokyo'], 
     dtype='|S9') 
>>> 
>>> test = ["tokyo", "tokyo", "paris"] 
>>> y_test = BlockRDD(sc.parallelize(test)) 
>>> 
>>> le.transform(y_test).toarray() 
array([2, 2, 1]) 
>>> 
>>> test = [2, 2, 1] 
>>> y_test = BlockRDD(sc.parallelize(test)) 
>>> 
>>> le.inverse_transform(y_test).toarray() 
array(['tokyo', 'tokyo', 'paris'], 
     dtype='|S9')

來源

2015-06-24 13:24:25 kszucs

StringIndexer是你需要 https://spark.apache.org/docs/1.5.1/ml-features.html#stringindexer

from pyspark.ml.feature import StringIndexer 

df = sqlContext.createDataFrame(
      [(0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")], 
      ["id", "category"]) 
indexer = StringIndexer(inputCol="category", outputCol="categoryIndex") 
indexed = indexer.fit(df).transform(df) 
indexed.show()

來源

2016-07-24 08:57:23

如何在Apache Spark中執行LabelEncoding或分類值

回答

相關問題