2017-06-27 26 views
1

我看了一些答案,包括this但似乎沒有回答我的問題。在Python中使用json.loads時,如何處理來自CSV的非ascii字符?

這裏有一些例子線從CSV:

_id category 
ObjectId(56266da778d34fdc048b470b) [{"group":"Home","id":"53cea0be763f4a6f4a8b459e","name":"Cleaning Services","name_singular":"Cleaning Service"}] 
ObjectId(56266e0c78d34f22058b46de) [{"group":"Local","id":"5637a1b178d34f20158b464f","name":"Balloon Dí©cor","name_singular":"Balloon Dí©cor"}] 

這裏是我的代碼:

import csv 
import sys 

from sys import argv 
import json 


def ReadCSV(csvfile): 
with open('newCSVFile.csv','wb') as g: 
    filewriter = csv.writer(g) #, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL) 

    with open(csvfile, 'rb') as f: 
     reader = csv.reader(f) # ceate reader object 
     next(reader) # skip first row 

     for row in reader: #go trhough all the rows 
      listForExport = [] #initialize list that will have two items: id and list of categories 

      # ID section 
      vendorId = str(row[0]) #pull the raw vendor id out of the first column of the csv 
      vendorId = vendorId[9:33] # slice to remove objectdId lable and parenthases 
      listForExport.append(vendorId) #add evendor ID to first item in list 


      # categories section 
      tempCatList = [] #temporarly list of categories for scond item in listForExport 

      #this is line 41 where the error stems 
      categories = json.loads(row[1]) #create's a dict with the categoreis from a given row 

      for names in categories: # loop through the categorie names using the key 'name' 

       print names['name'] 

這裏就是我得到:

Cleaning Services 
Traceback (most recent call last): 
    File "csvtesting.py", line 57, in <module> 
    ReadCSV(csvfile) 
    File "csvtesting.py", line 41, in ReadCSV 
    categories = json.loads(row[1]) #create's a dict with the categoreis from a given row 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 338, in loads 
return _default_decoder.decode(s) 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 366, in decode 
    obj, end = self.raw_decode(s, idx=_w(s, 0).end()) 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 382, in raw_decode 
    obj, end = self.scan_once(s, idx) 
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 9-10: invalid continuation byte 

因此,代碼翻出拳頭類別Cleaning Services,但是當我們遇到非ascii字符時失敗。

我該如何處理?我很高興刪除任何非ascii項目。

+0

你試過'your_string.encode('unicode_escape')。decode('utf-8','ignore')'? –

+0

不是。我會在代碼中放哪個? – dwstein

+0

我想在這種情況下,'your_string'只是'names ['name']'。 –

回答

1

當你打開rb模式的輸入csv文件,我假設你使用的是Python2.x版本。好消息是你在csv部分沒有問題,因爲csv閱讀器將讀取純文本字節而不試圖解釋它們。但json模塊將堅持將文本解碼爲unicode,默認情況下使用utf8。由於您的輸入文件不是utf8編碼,因此會產生一個UnicodeDecodeError。

Latin1的有一個很好的特性:任何字節的Unicode值是字節的只是價值,所以你一定要解碼任何東西 - 無論它使得然後感取決於實際的編碼是Latin1的的...

所以,你可能只是這樣做:

categories = json.loads(row[1], encoding="Latin1") 

另外,如果你想忽略非ASCII字符,你可以先轉換爲字節串爲Unicode忽略錯誤,然後才加載JSON:

categories = json.loads(row[1].decode(errors='ignore))  # ignore all non ascii characters 
+0

優秀!!謝謝你的幫助。因爲這些字符在後面的代碼中會產生問題,所以我用'忽略'去了。 – dwstein

0

很可能你的csv內容中有某些非ascii字符。

import re 

def remove_unicode(text): 
    if not text: 
     return text 

    if isinstance(text, str): 
     text = str(text.decode('ascii', 'ignore')) 
    else: 
     text = text.encode('ascii', 'ignore') 

    remove_ctrl_chars_regex = re.compile(r'[^\x20-\x7e]') 

    return remove_ctrl_chars_regex.sub('', text) 

... 
vendorId = remove_unicode(row[0]) 
... 
categories = json.loads(remove_unicode(row[1])) 
+0

嘗試過。現在獲得以下錯誤:UnicodeDecodeError:'utf8'編解碼器無法解碼位置67-68中的字節:無效繼續字節引用下面一行'在讀取器:'中的行。 – dwstein

+0

我想除了unicode之外,你的csv還有其他一些字符,爲什麼不把它們全部刪除? – hspandher

+0

我很樂意。我怎麼做? – dwstein

相關問題