2016-11-29 45 views
0

我試圖在for循環中從python中的statsmodel運行logit迴歸。所以我每次從測試數據中追加一行到我的訓練數據數據框中,並重新運行迴歸並存儲結果。奇怪的錯誤是阻止我測試我的logit迴歸分類器嗎?

現在,有趣的是,測試數據沒有得到正確追加(我認爲這導致了KeyError:0,我得到,但邀請您的意見在這裏)。我試過導入測試數據的兩個版本 - 一個與培訓數據相同的標籤,另一個沒有聲明標籤。

這裏是我的代碼:

import pandas as pd 
import numpy as np 
import statsmodels.api as sm 
import datetime 

df_train = pd.read_csv('Adult-Incomes/train-labelled-final-variables-condensed-coded-countries-removed-unlabelled-income-to-the-left-relabelled-copy.csv') 
print('Training set') 
print(df_train.head(15)) 

train_cols = df_train.columns[1:] 
logit = sm.Logit(df_train['Income'], df_train[train_cols]) 
result = logit.fit() 

print("ODDS RATIO") 
print(result.params) 
print("RESULTS SUMMARY") 
print(result.summary()) 
print("CONFIDENCE INTERVAL") 
print(result.conf_int()) 

#appnd test data 

print("PREDICTION PROCESS") 
print("READING TEST DATA") 
df_test = pd.read_csv('Adult-Incomes/test-final-variables-cleaned-coded-copy-relabelled.csv') 
print("TEST DATA READ COMPLETE") 

iteration_time = [] 
iteration_result = [] 
iteration_params = [] 
iteration_conf_int = [] 

df_train.to_pickle('train_iteration.pickle') 
print(df_test.head()) 

print("Loop begins") 

for row in range(0,len(df_test)): 
    start_time = datetime.datetime.now() 
    print("Loop iteration ", row, " in ", len(df_test), " rows") 

    df_train = pd.read_pickle('train_iteration.pickle') 
    print("pickle read") 
    df_train.append(df_test[row]) 
    print("row ", row, " appended") 
    train_cols = df_train.columns[1:] 
    print("X variables extracted in new DataFrame") 
    logit = sm.Logit(df_train['Income'], df_train[train_cols]) 
    print("Def logit reg eqn") 
    result = logit.fit() 
    print("fit logit reg eqn") 
    iteration_result[row] = result.summary() 
    print("logit result summary stored in array") 
    iteration_params[row] = result.params 
    print("logit params stored in array") 
    iteration_conf_int[row] = result.conf_int() 
    print("logit conf_int stored in array") 

    df_train.to_pickle('train_iteration.pickle') 
    print("exported to pickle") 

    end_time = datetime.datetime.now() 
    time_diff = start_time - end_time 
    print("time for this iteration is ", time_diff) 
    iteration_time[row] = time_diff 
    print("ending iteration, starting next iteration of loop...") 

print("Loop ends") 

pd.DataFrame(iteration_result) 
pd.DataFrame(iteration_time) 
print (iteration_result.head()) 
print (iteration_time.head()) 

它打印到此級別:

Loop iteration 0 in 15060 rows 
pickle read 

但隨後生成KeyError: 0

我在做什麼錯在這裏?有沒有標籤

Income Age Workclass Education Marital_Status Occupation \ 
0  0 1   4   7    4   6 
1  0 1   4   9    2   4 
2  1 1   6   12    2   10 
3  1 1   4   10    2   6 
4  0 1   4   6    4   7 

    Relationship Race Sex Capital_gain Capital_loss Hours_per_week 
0    3  2 0    0    0    40 
1    0  4 0    0    0    50 
2    0  4 0    0    0    40 
3    0  2 0   7688    0    40 
4    1  4 0    0    0    30 

測試數據的版本:

0 1 4 7 4.1 6 3 2 0.1 0.2 0.3 40 
0 0 1 4 9 2 4 0 4 0  0 0 50 
1 1 1 6 12 2 10 0 4 0  0 0 40 
2 1 1 4 10 2 6 0 2 0 7688 0 40 
3 0 1 4 6 4 7 1 4 0  0 0 30 
4 1 2 2 15 2 9 0 4 0 3103 0 32 

在這兩種情況下,如果我用標記或未標記的訓練數據

有標籤匹配訓練數據的測試數據的版本,我仍然在同一時間得到同樣的錯誤。

任何人都可以指導我如何繼續下去?

更新:這裏是完整的錯誤消息(前三行報表打印,錯誤從第四行開始):

Loop begins 
Loop iteration 0 in 15060 rows 
pickle read 
Traceback (most recent call last): 

    File "<ipython-input-10-1f56d5243e43>", line 1, in <module> 
    runfile('/media/deepak/Laniakea/Projects/Training/SPYDER/classifier/classifier_test2.py', wdir='/media/deepak/Laniakea/Projects/Training/SPYDER/classifier') 

    File "/usr/local/lib/python3.5/dist-packages/spyder/utils/site/sitecustomize.py", line 866, in runfile 
    execfile(filename, namespace) 

    File "/usr/local/lib/python3.5/dist-packages/spyder/utils/site/sitecustomize.py", line 102, in execfile 
    exec(compile(f.read(), filename, 'exec'), namespace) 

    File "/media/deepak/Laniakea/Projects/Training/SPYDER/classifier/classifier_test2.py", line 64, in <module> 
    df_train.append(df_test[row]) 

    File "/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py", line 2059, in __getitem__ 
    return self._getitem_column(key) 

    File "/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py", line 2066, in _getitem_column 
    return self._get_item_cache(key) 

    File "/usr/local/lib/python3.5/dist-packages/pandas/core/generic.py", line 1386, in _get_item_cache 
    values = self._data.get(item) 

    File "/usr/local/lib/python3.5/dist-packages/pandas/core/internals.py", line 3541, in get 
    loc = self.items.get_loc(item) 

    File "/usr/local/lib/python3.5/dist-packages/pandas/indexes/base.py", line 2136, in get_loc 
    return self._engine.get_loc(self._maybe_cast_indexer(key)) 

    File "pandas/index.pyx", line 139, in pandas.index.IndexEngine.get_loc (pandas/index.c:4443) 

    File "pandas/index.pyx", line 161, in pandas.index.IndexEngine.get_loc (pandas/index.c:4289) 

    File "pandas/src/hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13733) 

    File "pandas/src/hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13687) 

KeyError: 0 

UDPATE: 我得到這個在打印的最後一行(df_train.std ())語句,在所有列的std開發之後。 dtype: float64 所以,我猜我的訓練數據框被視爲浮動。

+0

我寧願標記數據以......開始......因爲在未標記數據中第一行正在被分配爲標題,看看你的沒有標籤的測試數據版本。您可以粘貼嘗試使用標記測試數據時遇到的錯誤嗎? –

+0

嗨,是的,在問題中添加了錯誤信息。看一看。 –

+0

此錯誤是因爲在未標記的測試集中,第一行正在被讀爲列標題...你可以嘗試使用帶標籤的測試集的附件,並讓我們知道錯誤?此外,請檢查您何時加載標記測試集,'header = True'存在 –

回答

1

我想我明白了...而不是下面的代碼 -

df_train.append(df_test[row]) 
print("row ", row, " appended") 

重寫它 -

df_train.append(df_test.iloc[row]) 
df_train = df_train.reset_index() 
print("row ", row, " appended") 

讓我知道如果這個服務的目的......它的種類每次重置索引都很重要......只是一件事 - 如果你的測試集相當大,這將是一場計算性災難,針對測試中看到的每個數據點進行培訓......

只是一條建議外部環境 - 如果你確實想要近實時地訓練它,試試使用批次或大塊測試集...