將列添加到python中的數據集中

我想將預測的數據添加回到我在Python中的原始數據集中。我想我應該使用Pandas和ASSIGN以及pd.DataFrame，但是在閱讀完所有文檔後，我不知道該如何編寫這個代碼（對不起，我是新手，剛開始學習編碼）。我已經在下面編寫了我的代碼，只需要代碼的幫助即可將我的預測添加回數據集。謝謝您的幫助！將列添加到python中的數據集中

# Importing the libraries 
import numpy as np 
import matplotlib.pyplot as plt 
import pandas as pd 

# Importing the dataset 
dataset = pd.read_csv('Social_Network_Ads.csv') 
X = dataset.iloc[:, [2, 3]].values 
y = dataset.iloc[:, 4].values 

# Splitting the dataset into the Training set and Test set 
from sklearn.cross_validation import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25,  
random_state = 0) 

# Feature Scaling X_train and X_test 
from sklearn.preprocessing import StandardScaler 
sc = StandardScaler() 
X_train = sc.fit_transform(X_train) 
X_test = sc.transform(X_test) 

#Feature scaling the all independent variables used to build the model 
whole_dataset = sc.transform(X) 

# Fitting classifier to the Training set 
# Create your Naive Bayes here 
from sklearn.naive_bayes import GaussianNB 
classifier = GaussianNB() 
classifier.fit(X_train, y_train) 

# Predicting the Test set results 
y_pred = classifier.predict_proba(X_test) 

# Predicting the results for the whole dataset 
y_pred2 = classifier.predict_proba(whole_dataset) 

# Add y_pred2 predictions back to the dataset 
???

來源

2017-06-15 zipline86

我想現在看着你想要做的事情，你誤解了正在發生的事情。您已將數據集分成一列火車和測試數據。然後，您在訓練數據集上進行訓練，然後對測試數據進行擬合。然後，您嘗試將原始數據集分配到所有行。例如，你在數據集中有400行，但在y_pred中只有100行，所以你不能分配不同長度的行。你想要做的是'y_pred = classifier.predict_proba（X）'，然後將其分配給：'dataset ['predict_class_1']，dataset ['predict_class_2'] = y_pred [：，0]，y_pred [：，1] ' – EdChum

非常感謝，我會嘗試一下！ :)我將代碼稍微改了一點，現在可以預測400行。我無法在這裏上傳數據文件，但可以在https://www.superdatascience.com/machine-learning/第18節naive bayes zip文件中下載。該csv文件被稱爲Social_Network_Ads.csv。我希望我能得到它的工作:) – zipline86

@EdChum它的工作！謝謝！ – zipline86

你可以只做dataset['prediction'] = y_pred添加一個新列。

Pandas支持添加新列的簡單語法，在這裏它將添加一個新列，並且可能會從sklearn返回的numpy數組上看到一個視圖，所以它應該很好並且很快。在你的代碼和數據

編輯

看，你誤會什麼train_test_split呢，這是分裂的數據到原始數據集，其具有400行的3/4 1/4分裂您X列車數據包含300行，測試數據爲100行。然後，您嘗試將您的原始數據集分配回400行。首先行數不匹配，其次從predict_proba返回的是預測類的百分比矩陣。所以，你要訓練後做什麼是預測對原始數據集和子選擇每列指定這個早在2列：

y_pred = classifier.predict_proba(X)

現在，將這個回：

dataset['predict_class_1'],dataset['predict_class_2'] = y_pred[:,0],y_pred[:,1]

來源

2017-06-15 08:42:25 EdChum

我試過了，但後來我得到了這個錯誤ValueError：錯誤數量的項目通過2，安置意味着1.任何想法，爲什麼發生這種情況？謝謝！ – zipline86

您需要將原始數據和代碼添加到您的問題中，以便我們重現此問題 – EdChum

有幾種解決方案The answer of EdChurm已經提到過一個。據我所知，熊貓有其他兩種方法可以使用它。

因爲你沒在使用中提供的數據，這裏是一個很簡單的例子。

import pandas as pd 
import numpy as np 
np.random.seed(1) 
df = pd.DataFrame(np.random.randn(10), columns=['raw']) 
df = df.assign(cube_raw=df['raw']**2) 
df.insert(1,'square_raw',df['raw']**3) 

df 
      raw square_raw  cube_raw 
0 1.624345 2.638498  4.285832 
1 -0.611756 0.374246 -0.228947 
2 -0.528172 0.278965 -0.147342 
3 -1.072969 1.151262 -1.235268 
4 0.865408 0.748930  0.648130 
5 -2.301539 5.297080 -12.191435 
6 1.744812 3.044368  5.311849 
7 -0.761207 0.579436 -0.441071 
8 0.319039 0.101786  0.032474 
9 -0.249370 0.062186 -0.015507

只要記住，df.assign()不就地工作，所以你應該重新分配給你的一個變量。

在我看來，我最喜歡df.insert()，因爲它允許你指定你想插入的位置。（帶參數loc）

來源

2017-06-15 09:21:53 CDtoday

我嘗試過創建df = dataset，然後df.assign（y_pred），但後來得到了此TypeError：assign（）需要1個位置參數但有2個。任何想法爲什麼我可以解決這個問題？謝謝！ – zipline86

@ zipline86'df.assign（）'的格式應該像'df.assign（_varname_ = content）'。您可能想要查看答案中的鏈接以獲取更多詳細信息。 – CDtoday

將列添加到python中的數據集中

回答

相關問題