Python - sklearn - 值錯誤：空詞彙

我想模擬一個以前完成的項目，我遇到了CountVectorizer函數的麻煩。以下是與該問題有關的代碼。Python - sklearn - 值錯誤：空詞彙

from __future__ import division 
import nltk, textmining, pprint, re, os.path 
#import numpy as np 
from nltk.corpus import gutenberg 
import fileinput 

list = ["carmilla.txt", "pirate-caribbee.txt", "rider-sage.txt"] 

for l in list: 
    f = open(l) 
    raw1 = f.read() 
    print "<-----Here goes nothing" 
    head = raw1[:680] 
    foot = raw1[157560:176380] 
    content = raw1[680:157560] 
    print "Done---->" 

content=[re.sub(r'[\']', '', text)for text in content] 
content=[re.sub(r'[^\w\s\.]', ' ', text) for text in content] 

print content 

propernouns = [] 
for story in content: 
    propernouns = propernouns+re.findall(r'Mr.[\s][\w]+', story) 
    propernouns = propernouns+re.findall(r'Mrs.[\s][\w]+', story) 
    propernouns = propernouns+re.findall(r'Ms.[\s][\w]+', story) 
    propernouns = propernouns+re.findall(r'Miss.[\s][\w]+', story) 

propernouns = set(propernouns) 
print "\nNumber of proper nouns: " + str(len(propernouns)) 
print "\nExamples from our list of proper nouns: "+str(sorted(propernouns)) 

#Strip all of the above out of text 
for word in propernouns: 
    content = [re.sub(" "+word+" "," ",story) for story in content] 

import string 
content = [story.translate(string.maketrans("",""), "_.")] 

print "\n[2] -----Carmilla Text-----" 
print content 

#Prepare a list of stopwords 
f1 = open('stopwords.txt', 'r') 
f2 = open('stopwords2.txt', 'w') 
for line in f1: 
    f2.write(line.replace('\n', ' ')) 
    f1.close() 
    f2.close() 

stopfile = open('stopwords2.txt') 

print "Examples of stopwords: " 
print stopfile.read() 

from sklearn.feature_extraction.text import CountVectorizer 
cv = CountVectorizer(stop_words = stopfile , min_df=1) 
stories_tdm = cv.fit_transform(content).toarray()

執行此沒有完成，我也得到這些錯誤：

Traceback (most recent call last): 
    File "C:\Users\mnate_000\workspace\de.vogella.python.third\src\TestFile_EDIT.py", line 84, in <module> 
    stories_tdm = cv.fit_transform(content).toarray() 
    File "C:\Users\mnate_000\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py", line 780, in fit_transform 
    vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary) 
    File "C:\Users\mnate_000\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py", line 727, in _count_vocab 
    raise ValueError("empty vocabulary; perhaps the documents only" 
**ValueError: empty vocabulary; perhaps the documents only contain stop words**

我不知道哪裏去了，因爲我已經試過用另一個文件替換「內容」作爲測試和它確定我沒有使用stopfile ..我似乎無法讓它正常運行。有沒有人遇到過這個問題？我錯過了一些簡單的東西嗎

來源

2014-03-31 Dillon

請記住要正確關閉文件。 f.close()是不存在的，f2.close()不應該縮進，也不應該f1.close()

我認爲這可能會解決您的問題。

for l in list: 
    f = open(l) 
    raw1 = f.read() 
    print "<-----Here goes nothing" 
    head = raw1[:680] 
    foot = raw1[157560:176380] 
    content = raw1[680:157560] 
    print "Done---->" 
    f.close()

...

#Prepare a list of stopwords 
f1 = open('stopwords.txt', 'r') 
f2 = open('stopwords2.txt', 'w') 
for line in f1: 
    f2.write(line.replace('\n', ' ')) 
f1.close() 
f2.close()

編輯我看到兩個問題：

一個是這樣的：含量= [story.translate（string.maketrans（「」，「」），「_.0123456789」）]

否story變量存在於此縮進級別，所以請求e澄清這一點。

另一個問題是stop_words可能是string，list或None。在string的情況下，唯一支持的值是'english'。然而，在你的情況，你通過一個文件句柄：

stopfile = open('stopwords2.txt') 
#... 
cv = CountVectorizer(stop_words = stopfile , min_df=1)

你應該做的是把在stopfile所有的文本字符串列表。替換此：

#Prepare a list of stopwords 
f1 = open('stopwords.txt', 'r') 
f2 = open('stopwords2.txt', 'w') 
for line in f1: 
    f2.write(line.replace('\n', ' ')) 
    f1.close() 
    f2.close() 

stopfile = open('stopwords2.txt') 

print "Examples of stopwords: " 
print stopfile.read() 

from sklearn.feature_extraction.text import CountVectorizer 
cv = CountVectorizer(stop_words = stopfile , min_df=1)

有了這個：

#Prepare a list of stopwords 
f1 = open('stopwords.txt', 'r') 
stoplist = [] 
for line in f1: 
    nextlist = line.replace('\n', ' ').split() 
    stoplist.extend(nextlist) 
f1.close() 

print "Examples of stopwords: " 
print stoplist 


from sklearn.feature_extraction.text import CountVectorizer 
cv = CountVectorizer(stop_words = stoplist, min_df=1)

來源

2014-03-31 19:41:32 AndyG

我在'f.close（）'添加和調整縮進感謝趕上兩個。但是，我仍然遇到同樣的問題。 – Dillon

@Dillon：你能告訴我什麼'content = [story.translate（string.maketrans（「」，「」），「_.0123456789」）]'應該做什麼？也就是說，'story'變量來自哪裏？在縮進級別上我沒有看到「故事」變量。 – AndyG

@Dillon：查看我編輯的帖子 – AndyG

Python - sklearn - 值錯誤：空詞彙

回答

相關問題