使用k-means，我得到一個錯誤;具有0功能的陣列

我正在嘗試使用matplotlib和k-means對我的csv數據進行聚類。使用k-means，我得到一個錯誤;具有0功能的陣列

我的csv數據是關於能源消耗。 https://github.com/camenergydatalab/EnergyDataSimulationChallenge/blob/master/challenge2/data/total_watt.csv

我想將每天的值分爲3組：低，中，高能耗。

這是我的代碼。

import numpy as np 
import matplotlib.pyplot as plt 
from matplotlib import style 
style.use('ggplot') 
import pandas as pd 
from sklearn.cluster import KMeans 



MY_FILE='total_watt.csv' 
date = [] 
consumption = [] 


df = pd.read_csv(MY_FILE, parse_dates=[0], index_col=[0]) 
df = df.resample('1D', how='sum') 


for row in df: 
    if len(row) ==2 : 
     date.append(row[0]) 
     consumption.append(row[1]) 


import datetime 
for x in range(len(date)): 
    date[x]=datetime.datetime.strptime(date[x], '%Y-%m-%d %H:%M:%S') 

X = np.array([date, consumption]) 
kmeans = KMeans(n_clusters=3) 
kmeans.fit(X) 

centroids = kmeans.cluster_centers_ 
labels = kmeans.labels_ 

print(centroids) 
print(labels) 

colors = ["b.","g.","r."] 

for i in range(len(X)): 
    print("coordinate:",X[i], "label:", labels[i]) 
    plt.plot(X[i][0], X[i][1], colors[labels[i]], markersize = 10) 

plt.scatter(centroids[:, 0],centroids[:, 1], marker = "x", s=150, linewidths = 5, zorder = 10) 

plt.show()

但是，當我執行此代碼時，我得到一個以下錯誤;

(DataVizProj)Soma-Suzuki:Soma Suzuki$ python 4.clusters.py 
Traceback (most recent call last): 
    File "4.clusters.py", line 31, in <module> 
    kmeans.fit(X) 
    File "/Users/Suzuki/Envs/DataVizProj/lib/python2.7/site-packages/sklearn/cluster/k_means_.py", line 785, in fit 
    X = self._check_fit_data(X) 
    File "/Users/Suzuki/Envs/DataVizProj/lib/python2.7/site-packages/sklearn/cluster/k_means_.py", line 755, in _check_fit_data 
    X = check_array(X, accept_sparse='csr', dtype=np.float64) 
    File "/Users/Suzuki/Envs/DataVizProj/lib/python2.7/site-packages/sklearn/utils/validation.py", line 367, in check_array 
    % (n_features, shape_repr, ensure_min_features)) 
ValueError: Found array with 0 feature(s) (shape=(2, 0)) while a minimum of 1 is required.

如何正確地將我的csv數據集羣。

編輯--------------------------------------------- --------

這是我的新代碼。謝謝！

import numpy as np 
import matplotlib.pyplot as plt 
from matplotlib import style 
style.use('ggplot') 
import pandas as pd 
from sklearn.cluster import KMeans 



MY_FILE='total_watt.csv' 
date = [] 
consumption = [] 


df = pd.read_csv(MY_FILE, parse_dates=[0], index_col=[0]) 
df = df.resample('1D', how='sum') 
df = df.dropna() 

date = df.index.tolist() 
consumption = df[df.columns[0]].values 



X = np.array([date, consumption]) 
kmeans = KMeans(n_clusters=3) 
kmeans.fit(X) 

centroids = kmeans.cluster_centers_ 
labels = kmeans.labels_ 

print(centroids) 
print(labels) 

colors = ["b.","g.","r."] 

for i in range(len(X)): 
    print("coordinate:",X[i], "label:", labels[i]) 
    plt.plot(X[i][0], X[i][1], colors[labels[i]], markersize = 10) 

plt.scatter(centroids[:, 0],centroids[:, 1], marker = "x", s=150, linewidths = 5, zorder = 10) 

plt.show()

和新的錯誤...

(DataVizProj)Soma-Suzuki:Soma Suzuki$ python 4.clusters.py 
Traceback (most recent call last): 
    File "4.clusters.py", line 26, in <module> 
    kmeans.fit(X) 
    File "/Users/Suzuki/Envs/DataVizProj/lib/python2.7/site-packages/sklearn/cluster/k_means_.py", line 785, in fit 
    X = self._check_fit_data(X) 
    File "/Users/Suzuki/Envs/DataVizProj/lib/python2.7/site-packages/sklearn/cluster/k_means_.py", line 755, in _check_fit_data 
    X = check_array(X, accept_sparse='csr', dtype=np.float64) 
    File "/Users/Suzuki/Envs/DataVizProj/lib/python2.7/site-packages/sklearn/utils/validation.py", line 344, in check_array 
    array = np.array(array, dtype=dtype, order=order, copy=copy) 
TypeError: float() argument must be a string or a number

EDITED2 ----------------------------- ------------

謝謝建勳!!

我終於成功了o集羣我的csv數據！非常感謝你！

import numpy as np 
import matplotlib.pyplot as plt 
from matplotlib import style 
style.use('ggplot') 
import pandas as pd 
from sklearn.cluster import KMeans 



MY_FILE='total_watt.csv' 
date = [] 
consumption = [] 


df = pd.read_csv(MY_FILE, parse_dates=[0], index_col=[0]) 
df = df.resample('1D', how='sum') 
df = df.dropna() 

date = df.index.tolist() 
date = [x.strftime('%Y-%m-%d') for x in date] 
from sklearn.preprocessing import LabelEncoder 

encoder = LabelEncoder() 
date_numeric = encoder.fit_transform(date) 
consumption = df[df.columns[0]].values 

X = np.array([date_numeric, consumption]).T 




kmeans = KMeans(n_clusters=3) 
kmeans.fit(X) 

centroids = kmeans.cluster_centers_ 
labels = kmeans.labels_ 

print(centroids) 
print(labels) 

colors = ["b.","r.","g."] 

for i in range(len(X)): 
    print("coordinate:",X[i], "label:", labels[i]) 
    plt.plot(X[i][0], X[i][1], colors[labels[i]], markersize = 10) 

plt.scatter(centroids[:, 0],centroids[:, 1], marker = "x", s=150, linewidths = 5, zorder = 10) 

plt.show()

enter image description here 但你可以看到，x軸不能反映時間，雖然我們設置正確....

來源

2015-07-10 Suzuki Soma

如果你想以可視化的消費分配，你應該考慮使用直方圖。 –

第一個問題：

for row in df: 
    if len(row) ==2 : 
     date.append(row[0]) 
     consumption.append(row[1])

這會給你意外的空列表date和consumption，因爲for row in df實際上在列上而不是在列上循環，這就是爲什麼你看到錯誤消息說它沒有任何功能。

而且，我已經看到有消費2 NaN，所以你需要df = df.dropna()（歸罪於這些缺失值），因爲sklearn不是NaN寬容。

爲了讓您的數據幀的數據，你可以寫這樣的事情

date = df.index.tolist() 
consumption = df[df.columns[0]].values

接下來，你已經解析的日期pd.read_csv，所以你的代碼的以下部分將無法工作。

import datetime 
for x in range(len(date)): 
    date[x]=datetime.datetime.strptime(date[x], '%Y-%m-%d %H:%M:%S')

最後，只需將原料進date與consumption爲KMeans不會產生太多有用的結果。您應該考慮將date轉換爲數字數據，例如，每週的假人。爲您繪製的問題

date = df.index.tolist() 

date = [x.strftime('%Y-%m-%d') for x in date] 

from sklearn.preprocessing import LabelEncoder 

encoder = LabelEncoder() 
date_numeric = encoder.fit_transform(date) 

# feed date_numeric with consumption into your KMeans 
# must use .T to transpose your X, sklearn think each column is a feature 
X = np.array([date_numeric, consumption]).T

：

要使用LabelEncoder

fig, ax = plt.subplots(figsize=(10,8)) 

colors = ["b.","r.","g."] 

for i in range(len(X)): 
    print("coordinate:",encoder.inverse_transform(X[i,0].astype(int)), X[i,1], "label:", labels[i]) 
    ax.plot(X[i][0], X[i][1], colors[labels[i]], markersize = 10) 

ax.scatter(centroids[:, 0],centroids[:, 1], marker = "x", s=150, linewidths = 5, zorder = 10) 
a = np.arange(0, len(X), 5) 
ax.set_xticks(a) 
ax.set_xticklabels(encoder.inverse_transform(a.astype(int)))

enter image description here

來源

2015-07-10 09:00:01

非常感謝！我只是修復了我的代碼！但我仍然得到另一個錯誤... 我編輯了我的問題。如果你能檢查它會很棒！而且，非常感謝您提供建議：「您應該將轉換日期轉換爲數字數據，例如，假設日期爲假。」但我必須將每天的值分爲3組：低，中，高能量消耗...... –

@SuzukiSoma新錯誤是因爲'date'是'datetime'對象，'sklearn'只接受數字或字符串數據。（你可以在錯誤信息的最後一行看到這一點）。如果您希望將'date'轉換爲字符串對象，請使用此代碼'date = [x.strftime（'％Y-％m-％d'）for date in date]' –

它說.... ValueError：無效文字爲float（）：2011-04-18 但我認爲''％Y-％m-％d'「是正確的。 –

使用k-means，我得到一個錯誤;具有0功能的陣列

回答

相關問題