2014-10-22 28 views
1

我正在運行logistic迴歸,並且在使用Patsy的API準備數據時遇到問題,因爲它比一個小樣本大。如何使用Patsy的API準備大型數據集?

直接在數據框使用dmatrices功能,我離開了這個突然的錯誤(請注意,我打滑了一個EC2的RAM 300GB遇到這對我的筆記本電腦後,並得到了相同的錯誤):

Traceback (most recent call last): 
File "My_File.py", line 22, in <module> 
    df, return_type="dataframe") 
File "/root/anaconda/lib/python2.7/site-packages/patsy/highlevel.py", line 297, in dmatrices 
NA_action, return_type) 
File "/root/anaconda/lib/python2.7/site-packages/patsy/highlevel.py", line 156, in do_highlevel_design 
return_type=return_type) 
File "/root/anaconda/lib/python2.7/site-packages/patsy/build.py", line 989, in build_design_matrices 
results.append(builder._build(evaluator_to_values, dtype)) 
File "/root/anaconda/lib/python2.7/site-packages/patsy/build.py", line 821, in _build 
m = DesignMatrix(np.empty((num_rows, self.total_columns), dtype=dtype), 
MemoryError 

所以,我通過Patsy's docs梳理,發現這種寶石:

patsy.incr_dbuilder(formula_like, data_iter_maker, eval_env=0) 
    Construct a design matrix builder incrementally from a large data set. 

然而,該方法被證明稀疏,源代碼主要是註釋。

我這段代碼已經抵達:

def iter_maker(): 
    with open("test.tsv", "r") as f: 
     reader = csv.DictReader(f, delimiter="\t") 
     for row in reader: 
      yield(row) 


y, dta = incr_dbuilders("s ~ C(x) + C(y):C(rgh) + \ 
C(z):C(f) + C(r):C(p) + C(q):C(w) + \ 
C(zr):C(rt) + C(ff):C(djjj) + C(hh):C(tt) + \ 
C(bb):lat + C(jj):lng + C(ee):C(bb) + C(qq):C(uu)", 
     iter_maker) 

df = dmatrix(dta, {}, 0, "drop", return_type="dataframe") 

,但我收到PatsyError: Error evaluating factor: NameError: name 'ff' is not defined

這被拋出,因爲_try_incr_builders(從dmatrix調用)將返回無on line 151 of highlevel.py

什麼是正確的方法使用這些Patsy函數來準備我的數據?您可能有任何示例或指導將會有所幫助。

回答

1

ydtaDesignInfo對象 - 它們編碼獲取一行數據幀並將其轉換爲設計矩陣的行所需的所有信息。但是,他們做的是而不是,但是,在其中有實際的數據 - 要獲得設計矩陣的一部分,必須給他們一部分數據。要使用它們,你需要做類似

for data_chunk in iter_maker(): 
    y_chunk, design_chunk = dmatrices((y, dta), data_chunk, 
            NA_action="drop", return_type="dataframe") 
    # do something with y_chunk and design_chunk 
    # ... 
相關問題