我試圖通過從較大的DataFrame中採樣100,000行數據來進行機器學習培訓/測試。我已經用30,000-60,000隨機樣本嘗試了預期輸出,但是當增加到100,000+時,它會給我記憶錯誤。在Python中進行機器學習時出現內存錯誤大熊貓
# coding=utf-8
import pandas as pd
from pandas import DataFrame, Series
import numpy as np
import nltk
import re
import random
from random import randint
import csv
import dask.dataframe as dd
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import Imputer
lr = LogisticRegression()
dv = DictVectorizer()
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)
# Get csv file into data frame
data = pd.read_csv("file.csv", header=0, encoding="utf-8")
df = DataFrame(data)
# Random sampling a smaller dataframe for debugging
rows = random.sample(df.index, 100000)
df = df.ix[rows] # Warning!!!! overwriting original df
# Assign X and y variables
X = df.raw_name.values
y = df.ethnicity2.values
# Feature extraction functions
def feature_full_last_name(nameString):
try:
last_name = nameString.rsplit(None, 1)[-1]
if len(last_name) > 1: # not accept name with only 1 character
return last_name
else: return '?'
except: return '?'
# Transform format of X variables, and spit out a numpy array for all features
my_dict = [{'last-name': feature_full_last_name(i)} for i in X]
all_dict = my_dict
newX = dv.fit_transform(all_dict).toarray()
# Separate the training and testing data sets
half_cut = int(len(df)/2.0)*-1
X_train = newX[:half_cut]
X_test = newX[half_cut:]
y_train = y[:half_cut]
y_test = y[half_cut:]
# Fitting X and y into model, using training data
lr.fit(X_train, y_train)
# Making predictions using trained data
y_train_predictions = lr.predict(X_train)
y_test_predictions = lr.predict(X_test)
print (y_train_predictions == y_train).sum().astype(float)/(y_train.shape[0])
print (y_test_predictions == y_test).sum().astype(float)/(y_test.shape[0])
錯誤聲明:
Traceback (most recent call last):
File "C:\Users\Dropbox\Python_Exercises\_Scraping\BeautifulSoup\FamilySearch.org\FamSearch_Analysis\MachineLearning\FamSearch_LogReg_GOOD8.py", line 93, in <module>
newX = dv.fit_transform(all_dict).toarray()
File "E:\Program Files Extra\Python27\lib\site-packages\scipy\sparse\compressed.py", line 942, in toarray
return self.tocoo(copy=False).toarray(order=order, out=out)
File "E:\Program Files Extra\Python27\lib\site-packages\scipy\sparse\coo.py", line 274, in toarray
B = self._process_toarray_args(order, out)
File "E:\Program Files Extra\Python27\lib\site-packages\scipy\sparse\base.py", line 793, in _process_toarray_args
return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError
你有多少內存? –
我有16.0 GB的內存。我的python在Win32上是2.7.6 [MSC v.1500 64位(AMD64)] – KubiK888
數據中有多少列?什麼是數據類型?原始數據的大小是從csv讀取的,然後是100k行的樣本?該數據幀仍然存在於內存中,因此您可能需要在進行分析之前將其刪除。實際上,數據本身可能仍然存在。刪除。 – Alexander