我目前的數據集包含大約28,000個觀察值和35個特徵。我的X矩陣包含了前34個特徵,我的矩陣包含了最後一個特徵或35個特徵(我已經在下面的代碼中將它標記爲HighLowMobility)我已經構建了一個神經網絡來分類高與低,然而我的算法的準確性由於缺少數據點,爲12%。我遇到了一些我的功能缺少大量數據點的問題。我繞過它的一種方式是填補缺失值的含義。這將算法的準確性提高到56%,但我不喜歡使用均值作爲缺失值的想法。我想尋求另一種方法尋求數據集中缺失值的解決方案
#loading the data into data frame
X = pd.read_csv('raw_data_for_edits.csv')
#Impute the missing values with mean values,.
X = X.fillna(X.mean())
#Dropping the categorical values
X = X.drop(['county_name','statename','stateabbrv'],axis=1)
#Collect the output in y variable
y = X['HighLowMobility']
我不能複製和粘貼我的整個數據集,因爲它太大,不過我貼在第一排12和15周的特點:
birthcohort countyfipscode county_name cty_pop2000 statename state_id stateabbrv perm_res_p25_kr24 perm_res_p75_kr24 perm_res_p25_c1823 perm_res_p75_c1823 perm_res_p25_c19 perm_res_p75_c19 perm_res_p25_kr26 perm_res_p75_kr26
1980 1001 Autauga 43671 Alabama 1 AL 45.29939 60.7061 20.79255 66.0626 40.33072 61.38815
1981 1001 Autauga 43671 Alabama 1 AL 42.61835 63.21074 29.72325 75.26598 18.54342 54.94438 39.72811 65.40214
1982 1001 Autauga 43671 Alabama 1 AL 48.26985 62.34378 38.06422 72.25443 21.53552 59.08011 44.65976 63.69386
1983 1001 Autauga 43671 Alabama 1 AL 42.63371 56.42043 38.25876 80.4664 15.57722 57.13945 40.6005 61.02879
1984 1001 Autauga 43671 Alabama 1 AL 44.01634 62.27992 38.12383 73.74701 23.0881 55.17943 43.34503 62.40761
1985 1001 Autauga 43671 Alabama 1 AL 45.71784 61.31874 40.93386 83.06611 25.66557 72.2912 42.42057 62.00612
1986 1001 Autauga 43671 Alabama 1 AL 47.92037 59.65535 47.48409 72.49103 28.89066 63.85233 42.06915 59.60703
1987 1001 Autauga 43671 Alabama 1 AL 48.31079 54.04203 53.19901 84.53795 35.28359 71.83407
1988 1001 Autauga 43671 Alabama 1 AL 47.98552 59.42001 52.89273 85.28442 30.55523 67.43595
1980 1003 Baldwin 140415 Alabama 1 AL 42.46106 51.41415 19.86316 58.6601 41.89684 55.88935
1981 1003 Baldwin 140415 Alabama 1 AL 43.00288 55.10138 35.59233 76.98567 11.48056 40.79744 42.46521 57.31494
注意如何功能「perm_res_p25_c1823」缺少值。就我的算法的準確性而言,這成爲問題。 因此,我應該怎麼做,因爲缺少值?我讀了一些關於插值的內容,我會這樣做嗎?如果是這樣,我會如何編碼?