我正在運行logistic迴歸,並且在使用Patsy的API準備數據時遇到問題,因爲它比一個小樣本大。如何使用Patsy的API準備大型數據集?
直接在數據框使用dmatrices
功能,我離開了這個突然的錯誤(請注意,我打滑了一個EC2的RAM 300GB遇到這對我的筆記本電腦後,並得到了相同的錯誤):
Traceback (most recent call last):
File "My_File.py", line 22, in <module>
df, return_type="dataframe")
File "/root/anaconda/lib/python2.7/site-packages/patsy/highlevel.py", line 297, in dmatrices
NA_action, return_type)
File "/root/anaconda/lib/python2.7/site-packages/patsy/highlevel.py", line 156, in do_highlevel_design
return_type=return_type)
File "/root/anaconda/lib/python2.7/site-packages/patsy/build.py", line 989, in build_design_matrices
results.append(builder._build(evaluator_to_values, dtype))
File "/root/anaconda/lib/python2.7/site-packages/patsy/build.py", line 821, in _build
m = DesignMatrix(np.empty((num_rows, self.total_columns), dtype=dtype),
MemoryError
所以,我通過Patsy's docs梳理,發現這種寶石:
patsy.incr_dbuilder(formula_like, data_iter_maker, eval_env=0)
Construct a design matrix builder incrementally from a large data set.
然而,該方法被證明稀疏,源代碼主要是註釋。
我這段代碼已經抵達:
def iter_maker():
with open("test.tsv", "r") as f:
reader = csv.DictReader(f, delimiter="\t")
for row in reader:
yield(row)
y, dta = incr_dbuilders("s ~ C(x) + C(y):C(rgh) + \
C(z):C(f) + C(r):C(p) + C(q):C(w) + \
C(zr):C(rt) + C(ff):C(djjj) + C(hh):C(tt) + \
C(bb):lat + C(jj):lng + C(ee):C(bb) + C(qq):C(uu)",
iter_maker)
df = dmatrix(dta, {}, 0, "drop", return_type="dataframe")
,但我收到PatsyError: Error evaluating factor: NameError: name 'ff' is not defined
這被拋出,因爲_try_incr_builders(從dmatrix調用)將返回無on line 151 of highlevel.py
什麼是正確的方法使用這些Patsy函數來準備我的數據?您可能有任何示例或指導將會有所幫助。