2015-06-01 83 views

回答

4

我們正在開發sparkit-learn,它旨在提供PySpark上的scikit-learn功能和API。您可以使用SparkLabelEncoder方式如下:

$ pip install sparkit-learn 
>>> from splearn.preprocessing import SparkLabelEncoder 
>>> from splearn import BlockRDD 
>>> 
>>> data = ["paris", "paris", "tokyo", "amsterdam"] 
>>> y = BlockRDD(sc.parallelize(data)) 
>>> 
>>> le = SparkLabelEncoder() 
>>> le.fit(y) 
>>> le.classes_ 
array(['amsterdam', 'paris', 'tokyo'], 
     dtype='|S9') 
>>> 
>>> test = ["tokyo", "tokyo", "paris"] 
>>> y_test = BlockRDD(sc.parallelize(test)) 
>>> 
>>> le.transform(y_test).toarray() 
array([2, 2, 1]) 
>>> 
>>> test = [2, 2, 1] 
>>> y_test = BlockRDD(sc.parallelize(test)) 
>>> 
>>> le.inverse_transform(y_test).toarray() 
array(['tokyo', 'tokyo', 'paris'], 
     dtype='|S9') 
1

StringIndexer是你需要 https://spark.apache.org/docs/1.5.1/ml-features.html#stringindexer

from pyspark.ml.feature import StringIndexer 

df = sqlContext.createDataFrame(
      [(0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")], 
      ["id", "category"]) 
indexer = StringIndexer(inputCol="category", outputCol="categoryIndex") 
indexed = indexer.fit(df).transform(df) 
indexed.show() 
相關問題