我試圖在for循環中從python中的statsmodel運行logit迴歸。所以我每次從測試數據中追加一行到我的訓練數據數據框中,並重新運行迴歸並存儲結果。奇怪的錯誤是阻止我測試我的logit迴歸分類器嗎?
現在,有趣的是,測試數據沒有得到正確追加(我認爲這導致了KeyError:0,我得到,但邀請您的意見在這裏)。我試過導入測試數據的兩個版本 - 一個與培訓數據相同的標籤,另一個沒有聲明標籤。
這裏是我的代碼:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import datetime
df_train = pd.read_csv('Adult-Incomes/train-labelled-final-variables-condensed-coded-countries-removed-unlabelled-income-to-the-left-relabelled-copy.csv')
print('Training set')
print(df_train.head(15))
train_cols = df_train.columns[1:]
logit = sm.Logit(df_train['Income'], df_train[train_cols])
result = logit.fit()
print("ODDS RATIO")
print(result.params)
print("RESULTS SUMMARY")
print(result.summary())
print("CONFIDENCE INTERVAL")
print(result.conf_int())
#appnd test data
print("PREDICTION PROCESS")
print("READING TEST DATA")
df_test = pd.read_csv('Adult-Incomes/test-final-variables-cleaned-coded-copy-relabelled.csv')
print("TEST DATA READ COMPLETE")
iteration_time = []
iteration_result = []
iteration_params = []
iteration_conf_int = []
df_train.to_pickle('train_iteration.pickle')
print(df_test.head())
print("Loop begins")
for row in range(0,len(df_test)):
start_time = datetime.datetime.now()
print("Loop iteration ", row, " in ", len(df_test), " rows")
df_train = pd.read_pickle('train_iteration.pickle')
print("pickle read")
df_train.append(df_test[row])
print("row ", row, " appended")
train_cols = df_train.columns[1:]
print("X variables extracted in new DataFrame")
logit = sm.Logit(df_train['Income'], df_train[train_cols])
print("Def logit reg eqn")
result = logit.fit()
print("fit logit reg eqn")
iteration_result[row] = result.summary()
print("logit result summary stored in array")
iteration_params[row] = result.params
print("logit params stored in array")
iteration_conf_int[row] = result.conf_int()
print("logit conf_int stored in array")
df_train.to_pickle('train_iteration.pickle')
print("exported to pickle")
end_time = datetime.datetime.now()
time_diff = start_time - end_time
print("time for this iteration is ", time_diff)
iteration_time[row] = time_diff
print("ending iteration, starting next iteration of loop...")
print("Loop ends")
pd.DataFrame(iteration_result)
pd.DataFrame(iteration_time)
print (iteration_result.head())
print (iteration_time.head())
它打印到此級別:
Loop iteration 0 in 15060 rows
pickle read
但隨後生成KeyError: 0
我在做什麼錯在這裏?有沒有標籤
Income Age Workclass Education Marital_Status Occupation \
0 0 1 4 7 4 6
1 0 1 4 9 2 4
2 1 1 6 12 2 10
3 1 1 4 10 2 6
4 0 1 4 6 4 7
Relationship Race Sex Capital_gain Capital_loss Hours_per_week
0 3 2 0 0 0 40
1 0 4 0 0 0 50
2 0 4 0 0 0 40
3 0 2 0 7688 0 40
4 1 4 0 0 0 30
測試數據的版本:
0 1 4 7 4.1 6 3 2 0.1 0.2 0.3 40
0 0 1 4 9 2 4 0 4 0 0 0 50
1 1 1 6 12 2 10 0 4 0 0 0 40
2 1 1 4 10 2 6 0 2 0 7688 0 40
3 0 1 4 6 4 7 1 4 0 0 0 30
4 1 2 2 15 2 9 0 4 0 3103 0 32
在這兩種情況下,如果我用標記或未標記的訓練數據
有標籤匹配訓練數據的測試數據的版本,我仍然在同一時間得到同樣的錯誤。
任何人都可以指導我如何繼續下去?
更新:這裏是完整的錯誤消息(前三行報表打印,錯誤從第四行開始):
Loop begins
Loop iteration 0 in 15060 rows
pickle read
Traceback (most recent call last):
File "<ipython-input-10-1f56d5243e43>", line 1, in <module>
runfile('/media/deepak/Laniakea/Projects/Training/SPYDER/classifier/classifier_test2.py', wdir='/media/deepak/Laniakea/Projects/Training/SPYDER/classifier')
File "/usr/local/lib/python3.5/dist-packages/spyder/utils/site/sitecustomize.py", line 866, in runfile
execfile(filename, namespace)
File "/usr/local/lib/python3.5/dist-packages/spyder/utils/site/sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "/media/deepak/Laniakea/Projects/Training/SPYDER/classifier/classifier_test2.py", line 64, in <module>
df_train.append(df_test[row])
File "/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py", line 2059, in __getitem__
return self._getitem_column(key)
File "/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py", line 2066, in _getitem_column
return self._get_item_cache(key)
File "/usr/local/lib/python3.5/dist-packages/pandas/core/generic.py", line 1386, in _get_item_cache
values = self._data.get(item)
File "/usr/local/lib/python3.5/dist-packages/pandas/core/internals.py", line 3541, in get
loc = self.items.get_loc(item)
File "/usr/local/lib/python3.5/dist-packages/pandas/indexes/base.py", line 2136, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/index.pyx", line 139, in pandas.index.IndexEngine.get_loc (pandas/index.c:4443)
File "pandas/index.pyx", line 161, in pandas.index.IndexEngine.get_loc (pandas/index.c:4289)
File "pandas/src/hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13733)
File "pandas/src/hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13687)
KeyError: 0
UDPATE: 我得到這個在打印的最後一行(df_train.std ())語句,在所有列的std開發之後。 dtype: float64
所以,我猜我的訓練數據框被視爲浮動。
我寧願標記數據以......開始......因爲在未標記數據中第一行正在被分配爲標題,看看你的沒有標籤的測試數據版本。您可以粘貼嘗試使用標記測試數據時遇到的錯誤嗎? –
嗨,是的,在問題中添加了錯誤信息。看一看。 –
此錯誤是因爲在未標記的測試集中,第一行正在被讀爲列標題...你可以嘗試使用帶標籤的測試集的附件,並讓我們知道錯誤?此外,請檢查您何時加載標記測試集,'header = True'存在 –