2017-03-08 28 views
0

我能夠在mylocal機器的elasticsearch索引中導入文本文件。在python3中跳過混合編碼文本中的非ascii字符的最佳做法是什麼?

儘管使用虛擬環境,生產機器簡直是一場噩夢,因爲我一直有這樣的錯誤:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 79: ordinal not in range(128) 

我使用python3,我個人是有在python2少的問題,也許它只是無奈浪費了幾個小時。

我不明白爲什麼,我不能剝奪或處理非ASCII字符:

我試圖導入:

from unidecode import unidecode 
def remove_non_ascii(text): 
    return unidecode(unicode(text, encoding = "utf-8")) 

使用python2,沒有成功。

回python3:

import string 
printable = set(string.printable) 

''.join(filter(lambda x: x in printable, 'mixed non ascii string') 

沒有成功

import codecs 
with codecs.open(path, encoding='utf8') as f: 
.... 

沒有成功

嘗試:

# -*- coding: utf-8 -*- 

沒有成功

https://docs.python.org/2/library/unicodedata.html#unicodedata.normalize

沒有成功...

上述所有似乎在具有下列錯誤無法剝離或處理非ASCII,這是非常麻煩的,我把:

with open(path) as f: 
    for line in f: 
     line = line.replace('\n','') 
     el = line.split('\t') 
     print (el) 
     _id = el[0] 
     _source = el[1] 
     _name = el[2] 
     # _description = ''.join(filter(lambda x: x in printable, el[-1])) 
     # 
     _description = remove_non_ascii(el[-1]) 
     print (_id, _source, _name, _description, setTipe(_source)) 
     action = { 
      "_index": _indexName, 
      "_type": setTipe(_source), 
      "_id": _source, 
      "_source": { 
       "name": _name, 
       "description" : _description 
       } 
      } 
     helpers.bulk(es, [action]) 

    File "<stdin>", line 22, in <module> 
    File "/usr/local/lib/python2.7/dist-packages/elasticsearch/helpers/__init__.py", line 194, in bulk 
    for ok, item in streaming_bulk(client, actions, **kwargs): 
    File "/usr/local/lib/python2.7/dist-packages/elasticsearch/helpers/__init__.py", line 162, in streaming_bulk 
    for result in _process_bulk_chunk(client, bulk_actions, raise_on_exception, raise_on_error, **kwargs): 
    File "/usr/local/lib/python2.7/dist-packages/elasticsearch/helpers/__init__.py", line 87, in _process_bulk_chunk 
    resp = client.bulk('\n'.join(bulk_actions) + '\n', **kwargs) 
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 79: ordinal not in range(128) 

我想有一個「確定」的做法來處理python3編碼問題 - 我在不同的機器上使用相同的腳本,並有不同的結果...

+0

提供實際重現您嘗試解決的問題的實例可以更輕鬆地解決問題。請參閱[如何提問](https://stackoverflow.com/help/how-to-ask)和[製作最小,完整,可驗證示例](https://stackoverflow.com/help/mcve)。 –

回答

1

ASCII字符是0-255 。

def remove_non_ascii(text): 
    ascii_characters = "" 
    for character in text: 
     if ord(character) <= 255: 
      ascii_characters += character 
    return ascii_characters 
相關問題