2017-01-17 68 views
0

當我閱讀使用pandas.read_csv一個CSV文件,我得到這個字符串:如何從CSV閱讀的unicode到數據幀

'_\xf4\xd6_' 

我不能正常化(落非ASCII字符):

>>> '_\xf4\xd6_'.encode("ascii","ignore") 
Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf4 in position 1: ordinal not in range(128) 

我要的是:

>>> u'_\xf4\xd6_'.encode("ascii","ignore") 
'__' 

IOW,我需要要麼

  • 告訴pandas.read_csv閱讀字符串爲Unicode或
  • 莫名其妙轉換strunicode自己。

我該怎麼做?

PS。爲了完整起見,這裏是代碼(見Get non-null elements in a pandas DataFrame):

import pandas as pd 

def normalize(s): 
    "Clean-up the string: drop non-ASCII, normalize whitespace." 
    return re.sub(r"\s+"," ",s,flags=re.UNICODE).encode("ascii","ignore") 

df = pd.read_csv("foo.csv",low_memory=False) 
my_strings = [normalize(s) for s in df[my_cols].stack.tolist()] 

PPS。我無法控制CSV文件的內容(即,我無法正確寫入CSV文件來解決問題)。

回答

0

下面是使用bytearray替代normalize

def drop_non_ascii(s): 
    if type(s) == unicode: 
     return s.encode("ascii",errors="ignore") 
    return bytearray(s).decode("ascii",errors="ignore") 

def normalize(s): 
    "Clean-up the string: drop non-ASCII, normalize whitespace." 
    return drop_non_ascii(re.sub(r"\s+"," ",s,flags=re.UNICODE)) 

這是正確的解決方案?