如何在磁盤中寫入名爲vectorizer.get_feature_names（）的方法的輸出？

我使用sklearn和Python向量化這段文字下面的代碼工作：如何在磁盤中寫入名爲vectorizer.get_feature_names（）的方法的輸出？

https://gist.github.com/adolfo255/be2bc75327e288d4d090659e231fa487

我的代碼是這樣的：

#!/usr/bin/env python 
# -*- coding: utf-8 
from sklearn.feature_extraction.text import TfidfVectorizer 
import pandas as pd 
f = open('text.txt') 
corpus= [] 
for line in f: 
     corpus.append(line), 
print(corpus)  
vectorizer = TfidfVectorizer(min_df=1,ngram_range=(1, 5),analyzer='char') 
X = vectorizer.fit_transform(corpus) 
idf = vectorizer.idf_ 
#print dict(zip(vectorizer.get_feature_names(), idf)) 
print (vectorizer.get_feature_names()) 
output= vectorizer.get_feature_names() 
target = open("output.txt", 'w') 
for line in output: 
    target.write(line), 
target.close() 
print(target)

一切順利，直到一部分，當我嘗試寫輸出，我想在磁盤上寫最後打印的輸出，我的意思是這樣的：

print (vectorizer.get_feature_names())

餘噸ried以下內容：

output= vectorizer.get_feature_names() 
target = open("output.txt", 'w') 
for line in output: 
    target.write(line), 
target.close() 
print(target)

但這種方法並沒有工作。我：

'ascii' codec can't encode character u'\xfa' in position 4: ordinal not in range(128) 
UnicodeEncodeError Traceback (most recent call last) 
main.py in <module>() 
    16 target = open("output.txt", 'w') 
    17 for line in output: 
---> 18  target.write(line), 
    19 target.close() 
    20 print(target) 
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfa' in position 4: ordinal not in range(128) 
File written 
output.txt

我希望如何實現這個任何建議，因爲我要分析的輸出後，問題與編碼有關，但我不知道如何解決它，我將不勝感激任何建議。

來源

2016-05-09 neo33

爲了將您的str（又名unicode）對象轉換爲要寫入文件的字節，Python需要使用某種編碼對其進行編碼。出於某種原因（無論是由於系統的默認編碼還是因爲您沒有粘貼的代碼），Python使用的是ASCII編碼，無法處理對象中的某些代碼點。

由於print(...)聲明沒有任何from __future__ import print_function，我認爲這是Python 3.一種修復方法是確保用於寫入文件的編碼是UTF-8。在我的系統上，這是默認值：

>>> import locale 
>>> locale.getpreferredencoding(False) 
'UTF-8'

因此，在我的機器上，您粘貼的代碼正常工作。您可以在您的open調用中指定編碼並覆蓋默認值，例如open('text.txt', encoding='utf-8')和open('output.txt', 'w', encoding='utf-8')（docs）。

對於理解這些問題可以找到一個很好的參考in the Unicode HOWTO。

如果您實際上使用Python 2，那麼您可能需要使用codecs.open，如here所述。

來源

2016-05-09 18:17:22

你好，關於版本對不起，我我工作在python 2.7.10，當我把這行：f = open（'text.txt'，encoding ='utf-8'），我得到以下錯誤：'encoding'是這個函數的一個無效的關鍵字參數，我不知道這是否與我正在編寫此版本的事實有關。 – neo33

是的，在Python 2.7.10中，'open'不需要'encoding'參數;這就是爲什麼HOWTO建議您使用'codecs.open'來代替，正如我所提到的。 –

因爲這個代碼在蟒蛇2.7.10書面使用的編解碼器的問題得到了解決，並呼籲圖書館：chardet的看到數據的編碼，如下

#!/usr/bin/env python 
# -*- coding: utf-8 
import codecs 
from sklearn.feature_extraction.text import TfidfVectorizer 
f = codecs.open('text.txt', encoding='utf-8') 
print type(f) 
import chardet 
rawdata=open('text.txt',"r").read() 
print(chardet.detect(rawdata)) 
corpus= [] 
for line in f: 
     corpus.append(line), 
vectorizer = TfidfVectorizer(min_df=1,ngram_range=(1, 5),analyzer='char') 
X = vectorizer.fit_transform(corpus) 
idf = vectorizer.idf_ 
print (vectorizer.get_feature_names()) 
output= vectorizer.get_feature_names() 
print type(output) 
target= codecs.open('output.txt', encoding='utf-8',mode='w+') 
for line in output: 
    target.write(line), 
target.close() 
print(target)

來源

2016-05-09 19:53:37 neo33

如何在磁盤中寫入名爲vectorizer.get_feature_names（）的方法的輸出？

回答

相關問題