2013-11-02 22 views
1
def openFile(fileName): 
    try: 
     trainFile = io.open(fileName,"r",encoding = "utf-8") 
    except IOError as e: 
     print ("File could not be opened: {}".format(e)) 
    else: 
     trainData = csv.DictReader(trainFile) 
     print trainData 
     return trainData 

def computeTFIDF(trainData): 
    bodyList = [] 
    print "Inside computeTFIDF" 
    for row in trainData: 
     for key, value in row.iteritems(): 
      print key, unicode(value, "utf-8", "ignore") 
    print "Done" 
    return 

if __name__ == "__main__": 
    print "Main" 
    trainData = openFile("../Data/TrainSample.csv") 
    print "File Opened" 
    computeTFIDF(trainData) 

錯誤:DictReader和UnicodeError

Traceback (most recent call last): 
    File "C:\DebSeal\IUB MS Program\IUB Sem III\Facebook Kaggle Comp\Src\facebookChallenge.py", line 62, in <module> 
    computeTFIDF(trainData) 
    File "C:\DebSeal\IUB MS Program\IUB Sem III\Facebook Kaggle Comp\Src\facebookChallenge.py", line 42, in computeTFIDF 
    for row in trainData: 
    File "C:\Python27\lib\csv.py", line 104, in next 
    row = self.reader.next() 
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 215: ordinal not in range(128) 

TrainSample.csv:是與4列(帶有報頭)csv文件。
操作系統:Windows 7 64位。
使用Python 2.x

我不知道這裏出了什麼問題。我說它忽略編碼。但仍然是拋出同樣的錯誤。

我覺得在控件達到編碼之前,它會拋出一個錯誤。

有誰可以告訴我我哪裏出錯了。

回答

4

Python 2 CSV模塊確實不是處理Unicode輸入。

以二進制模式打開文件,並在解析爲CSV後解碼。這對於UTF-8編解碼器是安全的,因爲換行符,分隔符和引號全部編碼爲1個字節。

csv模塊文檔包括example section中的UnicodeReader包裝類,它將爲您解碼;它很容易適應DictReader類:

import csv 

class UnicodeDictReader: 
    """ 
    A CSV reader which will iterate over lines in the CSV file "f", 
    which is encoded in the given encoding. 
    """ 

    def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds): 
     self.encoding = encoding 
     self.reader = csv.DictReader(f, dialect=dialect, **kwds) 

    def next(self): 
     row = self.reader.next() 
     return {k: unicode(v, "utf-8") for k, v in row.iteritems()} 

    def __iter__(self): 
     return self 

使用此以二進制模式打開文件:

def openFile(fileName): 
    try: 
     trainFile = open(fileName, "rb") 
    except IOError as e: 
     print "File could not be opened: {}".format(e) 
    else: 
     return UnicodeDictReader(trainFile) 
+0

感謝馬亭,我花了這麼多時間,只是弄清楚了這一點。謝謝tonnnnn ........ 它解決了這個問題...... :) –

+0

@thetoolman:對不起,但添加在<2.7工作周圍真的沒有必要。 2.7是近來事實上的Python 2版本;使用Python 2.6或更早版本的人可以自行找到dict解析的解決方法。 –