2015-07-10 24 views
0

我正在嘗試使用matplotlib和k-means對我的csv數據進行聚類。使用k-means,我得到一個錯誤;具有0功能的陣列

我的csv數據是關於能源消耗。 https://github.com/camenergydatalab/EnergyDataSimulationChallenge/blob/master/challenge2/data/total_watt.csv

我想將每天的值分爲3組:低,中,高能耗。

這是我的代碼。

import numpy as np 
import matplotlib.pyplot as plt 
from matplotlib import style 
style.use('ggplot') 
import pandas as pd 
from sklearn.cluster import KMeans 



MY_FILE='total_watt.csv' 
date = [] 
consumption = [] 


df = pd.read_csv(MY_FILE, parse_dates=[0], index_col=[0]) 
df = df.resample('1D', how='sum') 


for row in df: 
    if len(row) ==2 : 
     date.append(row[0]) 
     consumption.append(row[1]) 


import datetime 
for x in range(len(date)): 
    date[x]=datetime.datetime.strptime(date[x], '%Y-%m-%d %H:%M:%S') 

X = np.array([date, consumption]) 
kmeans = KMeans(n_clusters=3) 
kmeans.fit(X) 

centroids = kmeans.cluster_centers_ 
labels = kmeans.labels_ 

print(centroids) 
print(labels) 

colors = ["b.","g.","r."] 

for i in range(len(X)): 
    print("coordinate:",X[i], "label:", labels[i]) 
    plt.plot(X[i][0], X[i][1], colors[labels[i]], markersize = 10) 

plt.scatter(centroids[:, 0],centroids[:, 1], marker = "x", s=150, linewidths = 5, zorder = 10) 

plt.show() 

但是,當我執行此代碼時,我得到一個以下錯誤;

(DataVizProj)Soma-Suzuki:Soma Suzuki$ python 4.clusters.py 
Traceback (most recent call last): 
    File "4.clusters.py", line 31, in <module> 
    kmeans.fit(X) 
    File "/Users/Suzuki/Envs/DataVizProj/lib/python2.7/site-packages/sklearn/cluster/k_means_.py", line 785, in fit 
    X = self._check_fit_data(X) 
    File "/Users/Suzuki/Envs/DataVizProj/lib/python2.7/site-packages/sklearn/cluster/k_means_.py", line 755, in _check_fit_data 
    X = check_array(X, accept_sparse='csr', dtype=np.float64) 
    File "/Users/Suzuki/Envs/DataVizProj/lib/python2.7/site-packages/sklearn/utils/validation.py", line 367, in check_array 
    % (n_features, shape_repr, ensure_min_features)) 
ValueError: Found array with 0 feature(s) (shape=(2, 0)) while a minimum of 1 is required. 

如何正確地將我的csv數據集羣。

編輯--------------------------------------------- --------

這是我的新代碼。謝謝!

import numpy as np 
import matplotlib.pyplot as plt 
from matplotlib import style 
style.use('ggplot') 
import pandas as pd 
from sklearn.cluster import KMeans 



MY_FILE='total_watt.csv' 
date = [] 
consumption = [] 


df = pd.read_csv(MY_FILE, parse_dates=[0], index_col=[0]) 
df = df.resample('1D', how='sum') 
df = df.dropna() 

date = df.index.tolist() 
consumption = df[df.columns[0]].values 



X = np.array([date, consumption]) 
kmeans = KMeans(n_clusters=3) 
kmeans.fit(X) 

centroids = kmeans.cluster_centers_ 
labels = kmeans.labels_ 

print(centroids) 
print(labels) 

colors = ["b.","g.","r."] 

for i in range(len(X)): 
    print("coordinate:",X[i], "label:", labels[i]) 
    plt.plot(X[i][0], X[i][1], colors[labels[i]], markersize = 10) 

plt.scatter(centroids[:, 0],centroids[:, 1], marker = "x", s=150, linewidths = 5, zorder = 10) 

plt.show() 

和新的錯誤...

(DataVizProj)Soma-Suzuki:Soma Suzuki$ python 4.clusters.py 
Traceback (most recent call last): 
    File "4.clusters.py", line 26, in <module> 
    kmeans.fit(X) 
    File "/Users/Suzuki/Envs/DataVizProj/lib/python2.7/site-packages/sklearn/cluster/k_means_.py", line 785, in fit 
    X = self._check_fit_data(X) 
    File "/Users/Suzuki/Envs/DataVizProj/lib/python2.7/site-packages/sklearn/cluster/k_means_.py", line 755, in _check_fit_data 
    X = check_array(X, accept_sparse='csr', dtype=np.float64) 
    File "/Users/Suzuki/Envs/DataVizProj/lib/python2.7/site-packages/sklearn/utils/validation.py", line 344, in check_array 
    array = np.array(array, dtype=dtype, order=order, copy=copy) 
TypeError: float() argument must be a string or a number 

EDITED2 ----------------------------- ------------

謝謝建勳!!

我終於成功了o集羣我的csv數據! 非常感謝你!

import numpy as np 
import matplotlib.pyplot as plt 
from matplotlib import style 
style.use('ggplot') 
import pandas as pd 
from sklearn.cluster import KMeans 



MY_FILE='total_watt.csv' 
date = [] 
consumption = [] 


df = pd.read_csv(MY_FILE, parse_dates=[0], index_col=[0]) 
df = df.resample('1D', how='sum') 
df = df.dropna() 

date = df.index.tolist() 
date = [x.strftime('%Y-%m-%d') for x in date] 
from sklearn.preprocessing import LabelEncoder 

encoder = LabelEncoder() 
date_numeric = encoder.fit_transform(date) 
consumption = df[df.columns[0]].values 

X = np.array([date_numeric, consumption]).T 




kmeans = KMeans(n_clusters=3) 
kmeans.fit(X) 

centroids = kmeans.cluster_centers_ 
labels = kmeans.labels_ 

print(centroids) 
print(labels) 

colors = ["b.","r.","g."] 

for i in range(len(X)): 
    print("coordinate:",X[i], "label:", labels[i]) 
    plt.plot(X[i][0], X[i][1], colors[labels[i]], markersize = 10) 

plt.scatter(centroids[:, 0],centroids[:, 1], marker = "x", s=150, linewidths = 5, zorder = 10) 

plt.show() 

enter image description here 但你可以看到,x軸不能反映時間,雖然我們設置正確....

+0

如果你想以可視化的消費分配,你應該考慮使用直方圖。 –

回答

1

第一個問題:

for row in df: 
    if len(row) ==2 : 
     date.append(row[0]) 
     consumption.append(row[1]) 

這會給你意外的空列表dateconsumption,因爲for row in df實際上在列上而不是在列上循環,這就是爲什麼你看到錯誤消息說它沒有任何功能。

而且,我已經看到有消費2 NaN,所以你需要df = df.dropna()(歸罪於這些缺失值),因爲sklearn不是NaN寬容。

爲了讓您的數據幀的數據,你可以寫這樣的事情

date = df.index.tolist() 
consumption = df[df.columns[0]].values 

接下來,你已經解析的日期pd.read_csv,所以你的代碼的以下部分將無法工作。

import datetime 
for x in range(len(date)): 
    date[x]=datetime.datetime.strptime(date[x], '%Y-%m-%d %H:%M:%S') 

最後,只需將原料進dateconsumptionKMeans不會產生太多有用的結果。您應該考慮將date轉換爲數字數據,例如,每週的假人。爲您繪製的問題

date = df.index.tolist() 

date = [x.strftime('%Y-%m-%d') for x in date] 

from sklearn.preprocessing import LabelEncoder 

encoder = LabelEncoder() 
date_numeric = encoder.fit_transform(date) 

# feed date_numeric with consumption into your KMeans 
# must use .T to transpose your X, sklearn think each column is a feature 
X = np.array([date_numeric, consumption]).T 

要使用LabelEncoder

fig, ax = plt.subplots(figsize=(10,8)) 

colors = ["b.","r.","g."] 

for i in range(len(X)): 
    print("coordinate:",encoder.inverse_transform(X[i,0].astype(int)), X[i,1], "label:", labels[i]) 
    ax.plot(X[i][0], X[i][1], colors[labels[i]], markersize = 10) 

ax.scatter(centroids[:, 0],centroids[:, 1], marker = "x", s=150, linewidths = 5, zorder = 10) 
a = np.arange(0, len(X), 5) 
ax.set_xticks(a) 
ax.set_xticklabels(encoder.inverse_transform(a.astype(int))) 

enter image description here

+0

非常感謝! 我只是修復了我的代碼!但我仍然得到另一個錯誤... 我編輯了我的問題。如果你能檢查它會很棒! 而且,非常感謝您提供建議:「您應該將轉換日期轉換爲數字數據,例如,假設日期爲假。」但我必須將每天的值分爲3組:低,中,高能量消耗...... –

+0

@SuzukiSoma新錯誤是因爲'date'是'datetime'對象,'sklearn'只接受數字或字符串數​​據。 (你可以在錯誤信息的最後一行看到這一點)。如果您希望將'date'轉換爲字符串對象,請使用此代碼'date = [x.strftime('%Y-%m-%d')for date in date]' –

+0

它說.... ValueError:無效文字爲float():2011-04-18 但我認爲''%Y-%m-%d'「是正確的。 –

相關問題