在Python中使用json.loads時，如何處理來自CSV的非ascii字符？

我看了一些答案，包括this但似乎沒有回答我的問題。在Python中使用json.loads時，如何處理來自CSV的非ascii字符？

這裏有一些例子線從CSV：

_id category 
ObjectId(56266da778d34fdc048b470b) [{"group":"Home","id":"53cea0be763f4a6f4a8b459e","name":"Cleaning Services","name_singular":"Cleaning Service"}] 
ObjectId(56266e0c78d34f22058b46de) [{"group":"Local","id":"5637a1b178d34f20158b464f","name":"Balloon Dí©cor","name_singular":"Balloon Dí©cor"}]

這裏是我的代碼：

import csv 
import sys 

from sys import argv 
import json 


def ReadCSV(csvfile): 
with open('newCSVFile.csv','wb') as g: 
    filewriter = csv.writer(g) #, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL) 

    with open(csvfile, 'rb') as f: 
     reader = csv.reader(f) # ceate reader object 
     next(reader) # skip first row 

     for row in reader: #go trhough all the rows 
      listForExport = [] #initialize list that will have two items: id and list of categories 

      # ID section 
      vendorId = str(row[0]) #pull the raw vendor id out of the first column of the csv 
      vendorId = vendorId[9:33] # slice to remove objectdId lable and parenthases 
      listForExport.append(vendorId) #add evendor ID to first item in list 


      # categories section 
      tempCatList = [] #temporarly list of categories for scond item in listForExport 

      #this is line 41 where the error stems 
      categories = json.loads(row[1]) #create's a dict with the categoreis from a given row 

      for names in categories: # loop through the categorie names using the key 'name' 

       print names['name']

這裏就是我得到：

Cleaning Services 
Traceback (most recent call last): 
    File "csvtesting.py", line 57, in <module> 
    ReadCSV(csvfile) 
    File "csvtesting.py", line 41, in ReadCSV 
    categories = json.loads(row[1]) #create's a dict with the categoreis from a given row 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 338, in loads 
return _default_decoder.decode(s) 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 366, in decode 
    obj, end = self.raw_decode(s, idx=_w(s, 0).end()) 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 382, in raw_decode 
    obj, end = self.scan_once(s, idx) 
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 9-10: invalid continuation byte

因此，代碼翻出拳頭類別Cleaning Services，但是當我們遇到非ascii字符時失敗。

我該如何處理？我很高興刪除任何非ascii項目。

來源

2017-06-27 dwstein

你試過'your_string.encode（'unicode_escape'）。decode（'utf-8'，'ignore'）'？ –

不是。我會在代碼中放哪個？ – dwstein

我想在這種情況下，'your_string'只是'names ['name']'。 –

當你打開rb模式的輸入csv文件，我假設你使用的是Python2.x版本。好消息是你在csv部分沒有問題，因爲csv閱讀器將讀取純文本字節而不試圖解釋它們。但json模塊將堅持將文本解碼爲unicode，默認情況下使用utf8。由於您的輸入文件不是utf8編碼，因此會產生一個UnicodeDecodeError。

Latin1的有一個很好的特性：任何字節的Unicode值是字節的只是價值，所以你一定要解碼任何東西 - 無論它使得然後感取決於實際的編碼是Latin1的的...

所以，你可能只是這樣做：

categories = json.loads(row[1], encoding="Latin1")

另外，如果你想忽略非ASCII字符，你可以先轉換爲字節串爲Unicode忽略錯誤，然後才加載JSON：

categories = json.loads(row[1].decode(errors='ignore))  # ignore all non ascii characters

來源

2017-06-27 13:33:01

優秀!!謝謝你的幫助。因爲這些字符在後面的代碼中會產生問題，所以我用'忽略'去了。 – dwstein

很可能你的csv內容中有某些非ascii字符。

import re 

def remove_unicode(text): 
    if not text: 
     return text 

    if isinstance(text, str): 
     text = str(text.decode('ascii', 'ignore')) 
    else: 
     text = text.encode('ascii', 'ignore') 

    remove_ctrl_chars_regex = re.compile(r'[^\x20-\x7e]') 

    return remove_ctrl_chars_regex.sub('', text) 

... 
vendorId = remove_unicode(row[0]) 
... 
categories = json.loads(remove_unicode(row[1]))

來源

2017-06-27 12:34:33 hspandher

嘗試過。現在獲得以下錯誤：UnicodeDecodeError：'utf8'編解碼器無法解碼位置67-68中的字節：無效繼續字節引用下面一行'在讀取器：'中的行。 – dwstein

我想除了unicode之外，你的csv還有其他一些字符，爲什麼不把它們全部刪除？ – hspandher

我很樂意。我怎麼做？ – dwstein

在Python中使用json.loads時，如何處理來自CSV的非ascii字符？

回答

相關問題