我需要處理超過1,000,000條文本記錄。我正在使用CountVectorizer來轉換我的數據。我有以下代碼。sklearn中的矢量化似乎非常昂貴。爲什麼?
TEXT = [data[i].values()[3] for i in range(len(data))] #these are the text records
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=1)
X = vectorizer.fit_transform(TEXT)
X_list = X.toarray().tolist()
當我運行此代碼時,事實證明MemoryError
。我擁有的文字記錄大部分是短文(約100字)。矢量化似乎非常昂貴。
UPDATE
我增加了更多的約束CountVectorizer但仍然有MemoeryError。的feature_names
長度爲2391
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=0.003,max_df = 3.05, lowercase = True, stop_words = 'english')
X = vectorizer.fit_transform(TEXT)
feature_names = vectorizer.get_feature_names()
X_tolist = X.toarray().tolist()
Traceback (most recent call last):
File "nlp2.py", line 42, in <module>
X_tolist = X.toarray().tolist()
File "/opt/conda/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 940, in toarray
return self.tocoo(copy=False).toarray(order=order, out=out)
File "/opt/conda/lib/python2.7/site-packages/scipy/sparse/coo.py", line 250, in toarray
B = self._process_toarray_args(order, out)
File "/opt/conda/lib/python2.7/site-packages/scipy/sparse/base.py", line 817, in _process_toarray_args
return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError
爲何如此,如何用它來解決?謝謝!!
你可以給我們訪問到你的數據集?此外,哪一行是MemoryError?你能給我們跟蹤嗎? – bpachev
感謝bpachev,我不知道如何讓你訪問數據集,因爲它是在一個安全的遠程服務器。 MemoryError只有當我執行X_list = X.toarray()時纔會出現tolist()'我被告知設置min和max_df,我只有min。 – achimneyswallow