這是來自Kaggle的泰坦尼克號競賽的數據集(train和test csv文件)。每個文件都具有乘客的特徵,例如身份證,性別,年齡等。火車文件具有0和1值的「存活」列。測試文件缺少存活列,因爲它必須被預測。 這是我使用隨機森林給我一個標杆首發簡單的代碼:如何製作(是/否或1-0)隨機森林決策?
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
import random
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_curve, auc
train=pd.read_csv('train.csv')
test=pd.read_csv('test.csv')
train['Type']='Train' #Create a flag for Train and Test Data set
test['Type']='Test'
fullData = pd.concat([train,test],axis=0) #Combined both Train and Test Data set
ID_col = ['PassengerId']
target_col = ["Survived"]
cat_cols = ['Name','Ticket','Sex','Cabin','Embarked']
num_cols= ['Pclass','Age','SibSp','Parch','Fare']
other_col=['Type'] #Test and Train Data set identifier
num_cat_cols = num_cols+cat_cols # Combined numerical and Categorical variables
for var in num_cat_cols:
if fullData[var].isnull().any()==True:
fullData[var+'_NA']=fullData[var].isnull()*1
#Impute numerical missing values with mean
fullData[num_cols] = fullData[num_cols].fillna(fullData[num_cols].mean(),inplace=True)
#Impute categorical missing values with -9999
fullData[cat_cols] = fullData[cat_cols].fillna(value = -9999)
#create label encoders for categorical features
for var in cat_cols:
number = LabelEncoder()
fullData[var] = number.fit_transform(fullData[var].astype('str'))
train=fullData[fullData['Type']=='Train']
test=fullData[fullData['Type']=='Test']
train['is_train'] = np.random.uniform(0, 1, len(train)) <= .75
Train, Validate = train[train['is_train']==True], train[train['is_train']==False]
features=list(set(list(fullData.columns))-set(ID_col)-set(target_col)-set(other_col))
x_train = Train[list(features)].values
y_train = Train["Survived"].values
x_validate = Validate[list(features)].values
y_validate = Validate["Survived"].values
x_test=test[list(features)].values
Train[list(features)]
#*************************
from sklearn import tree
random.seed(100)
rf = RandomForestClassifier(n_estimators=1000)
rf.fit(x_train, y_train)
status = rf.predict_proba(x_validate)
fpr, tpr, _ = roc_curve(y_validate, status[:,1]) #metrics. added by me
roc_auc = auc(fpr, tpr)
print(roc_auc)
final_status = rf.predict_proba(x_test)
test["Survived2"]=final_status[:,1]
test['my prediction']=np.where(test.Survived2 > 0.6, 1, 0)
test
正如你所看到的,final_status給人的生存概率。我想知道如何從中得到是/否(1或0)的答案。我能想到的最簡單的事情就是說,如果概率大於0.6,那麼這個人倖存下來並以其他方式死亡(「我的預測」專欄),但是一旦我提交了結果,預測就根本不好。
我很欣賞任何見解。謝謝
您能否向我們提供'test.csv'和'train.csv',以便我們可以運行代碼? –
Eric已經上傳了。請參閱我的文章的第一行。只需下載它們,代碼即可運行。謝謝 – user3709260