2015-10-16 28 views
0

我試圖通過從較大的DataFrame中採樣100,000行數據來進行機器學習培訓/測試。我已經用30,000-60,000隨機樣本嘗試了預期輸出,但是當增加到100,000+時,它會給我記憶錯誤。在Python中進行機器學習時出現內存錯誤大熊貓

# coding=utf-8 
import pandas as pd 
from pandas import DataFrame, Series 
import numpy as np 
import nltk 
import re 
import random 
from random import randint 
import csv 
import dask.dataframe as dd 
import sys 
reload(sys) 
sys.setdefaultencoding('utf-8') 

from sklearn.linear_model import LogisticRegression 
from sklearn.feature_extraction import DictVectorizer 
from sklearn.preprocessing import Imputer 

lr = LogisticRegression() 
dv = DictVectorizer() 
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0) 

# Get csv file into data frame 
data = pd.read_csv("file.csv", header=0, encoding="utf-8") 
df = DataFrame(data) 

# Random sampling a smaller dataframe for debugging 
rows = random.sample(df.index, 100000) 
df = df.ix[rows] # Warning!!!! overwriting original df 

# Assign X and y variables 
X = df.raw_name.values 
y = df.ethnicity2.values 

# Feature extraction functions 
def feature_full_last_name(nameString): 
    try: 
     last_name = nameString.rsplit(None, 1)[-1] 
     if len(last_name) > 1: # not accept name with only 1 character 
      return last_name 
     else: return '?' 
    except: return '?' 

# Transform format of X variables, and spit out a numpy array for all features 
my_dict = [{'last-name': feature_full_last_name(i)} for i in X] 

all_dict = my_dict 

newX = dv.fit_transform(all_dict).toarray() 

# Separate the training and testing data sets 
half_cut = int(len(df)/2.0)*-1 
X_train = newX[:half_cut] 
X_test = newX[half_cut:] 
y_train = y[:half_cut] 
y_test = y[half_cut:] 

# Fitting X and y into model, using training data 
lr.fit(X_train, y_train) 

# Making predictions using trained data 
y_train_predictions = lr.predict(X_train) 
y_test_predictions = lr.predict(X_test) 

print (y_train_predictions == y_train).sum().astype(float)/(y_train.shape[0]) 
print (y_test_predictions == y_test).sum().astype(float)/(y_test.shape[0]) 

錯誤聲明:

Traceback (most recent call last): 
    File "C:\Users\Dropbox\Python_Exercises\_Scraping\BeautifulSoup\FamilySearch.org\FamSearch_Analysis\MachineLearning\FamSearch_LogReg_GOOD8.py", line 93, in <module> 
    newX = dv.fit_transform(all_dict).toarray() 
    File "E:\Program Files Extra\Python27\lib\site-packages\scipy\sparse\compressed.py", line 942, in toarray 
    return self.tocoo(copy=False).toarray(order=order, out=out) 
    File "E:\Program Files Extra\Python27\lib\site-packages\scipy\sparse\coo.py", line 274, in toarray 
    B = self._process_toarray_args(order, out) 
    File "E:\Program Files Extra\Python27\lib\site-packages\scipy\sparse\base.py", line 793, in _process_toarray_args 
    return np.zeros(self.shape, dtype=self.dtype, order=order) 
MemoryError 
+1

你有多少內存? –

+0

我有16.0 GB的內存。我的python在Win32上是2.7.6 [MSC v.1500 64位(AMD64)] – KubiK888

+0

數據中有多少列?什麼是數據類型?原始數據的大小是從csv讀取的,然後是100k行的樣本?該數據幀仍然存在於內存中,因此您可能需要在進行分析之前將其刪除。實際上,數據本身可能仍然存在。刪除。 – Alexander

回答

2

這看起來錯:

newX = dv.fit_transform(all_dict).toarray() 

因爲在幾乎所有的估計scikit學習支持稀疏數據集,但你正試圖使密從稀疏數據集。當然,它會消耗大量的內存。您需要避免在代碼中使用todense()和toarray()方法。

+0

我不知道如何準備ML分類器要訓練的數據格式,否則請告知。 – KubiK888

+0

@ KubiK888,只需從該行中刪除'.toarray()',我認爲其他所有東西仍然可以工作。 –

+0

這樣做,謝謝 – KubiK888