2016-07-09 45 views
-3

我使用Cloudera 5.2 VM和pandas 0.18.0 我想將kmeans應用於我的數據框。但我有很多專欄。pandas kmeans如何使用分類屬性

我的數據幀是

adClicksPerTime.head(n=5) 
Out[50]: 
      timestamp adCategory userId totalAdClicks 
0 2016-05-26 15:00:00 automotive  355    1 
1 2016-05-26 15:00:00  clothing 1027    1 
2 2016-05-26 15:00:00 computers 1821    1 
3 2016-05-26 15:00:00 computers 2139    1 
4 2016-05-26 15:00:00 electronics  253    1 

for col in adClicksPerTime: 
    print(col) 
    print(type(adClicksPerTime[col][1])) 


timestamp 
<class 'pandas.tslib.Timestamp'> 
adCategory 
<class 'str'> 
userId 
<class 'numpy.int64'> 
totalAdClicks 
<class 'numpy.int64'> 

當我執行k均值我得到

ValueError: could not convert string to float: 'automotive' 

我想我的字符串轉換爲明確的類型,之後分配數字代碼

adClicksPerTime.adCategory = pd.Categorical.from_array(adClicksPerTime.adCategory)  

adClicksPerTime.head(n=5) 
Out[54]: 
      timestamp adCategory userId totalAdClicks 
0 2016-05-26 15:00:00 automotive  355    1 
1 2016-05-26 15:00:00  clothing 1027    1 
2 2016-05-26 15:00:00 computers 1821    1 
3 2016-05-26 15:00:00 computers 2139    1 
4 2016-05-26 15:00:00 electronics  253    1 

for col in adClicksPerTime: 
    print(col) 
    print(type(adClicksPerTime[col][1])) 


timestamp 
<class 'pandas.tslib.Timestamp'> 
adCategory 
<class 'str'> 
userId 
<class 'numpy.int64'> 
totalAdClicks 
<class 'numpy.int64'> 
錯誤

如何將kmeans應用到str字段?

+0

k-means僅用於**連續**變量。不要在這類數據上使用它! –

回答

1

獲取假人會將類別更改爲假人。

dummies = pd.get_dummies(adClicksPerTime[adCategory]) 
del dummies['automotive'] 
print dummies.columns 

然後將這個DataFrame與adClicksPerTime dataFrame合併,最後應用Kmeans。

adClicksPerTime.info()會給你dtypes。