2015-11-04 54 views
-1

我有一個場景,其中發送用於分析的日誌文件有一些非ASCII字符,並最終打破了我無法控制的分析工具之一。所以我決定自己清理一下這個日誌,並且提出了以下這個工作,除了當我看到這些字符時我會跳過整條線。我 嘗試逐行字符(檢查註釋)的代碼,以便只有這些字符可以被刪除並保存實際的ASCII字符,但不能成功。 該評論邏輯和建議/解決方案能否解決該問題的任何原因?使用python從文件中刪除非ASCII字符

1:02:失敗

採樣線54.934/174573 ENQÎNULSUB AY NULEOT/29/abcdefghijg

功能來讀取和刪除線:

def readlogfile(self, abs_file_name): 
    """ 
    Reads and skip the non-ascii chars line from the attached log file and populate the list self.data_bytes 
    abs_file_name file name should be absolute path 
    """ 
    try: 
     infile = open(abs_file_name, 'rb') 
     for line in infile: 
      try: 
       line.decode('ascii') 
       self._data_bytes.append(line) 
      except UnicodeDecodeError as e : 
       # print line + "Invalid line skipped in " + abs_file_name 
       print line 
       continue 
      # while 1: #code that didn't work to remove just the non-ascii chars 
      #  char = infile.read(1)   # read characters from file 
      #  if not char or ord(char) > 127 or ord(char) < 0: 
      #   continue 
      #  else: 
      #   sys.stdout.write(char) 
      #   #sys.stdout.write('{}'.format(ord(char))) 
      #   #print "%s ord = %d" % (char, ord(char)) 
      #   self._data_bytes.append(char) 
    finally: 
     infile.close() 
+0

http://stackoverflow.com/questions/33511317/removing-non-ascii-characters-from-file-text/33511747#33511747這傢伙原代碼應該爲你工作。 –

回答

1

de代碼需要另一個參數,如何處理不好的字符。 https://docs.python.org/2/library/stdtypes.html#string-methods

試試這個

print "1:02:54.934/174573ENQÎNULSUBáyNULEOT/29/abcdefghijg".decode("ascii", "ignore")

u'1:02:54.934/174573ENQNULSUByNULEOT/29/abcdefghijg' 

,你的代碼可以簡化到像這樣

def readlogfile(self, abs_file_name): 
    """ 
    Reads and skip the non-ascii chars line from the attached log file and populate the list self.data_bytes 
    abs_file_name file name should be absolute path 
    """ 
    with open(abs_file_name, 'rb') as infile: 
     while True: 
      line = infile.readline() 
      if not line: 
       break 
      self._data_bytes.append(line.decode("ascii", "ignore")) 
+0

你可以建議如何複製具有特殊字符的實際文本?我相信還有一些其他角色在複製時錯過了,並且解析器仍然與解析器斷裂。 @ Dave_750 – Guruprasad

+0

你也可以嘗試line.decode(「ascii」,「ignore」)。encode(「ascii」)如果它仍然很挑剔 –

0

我認爲這是處理上得罪行有道逐字符的基礎:

import codecs 

class MyClass(object): 
    def __init__(self): 
     self._data_bytes = [] 

    def readlogfile(self, abs_file_name): 
     """ 
     Reads and skips the non-ascii chars line from the attached log file and 
     populate the list self.data_bytes abs_file_name file name should be 
     absolute path 
     """ 
     with codecs.open(abs_file_name, 'r', encoding='utf-8') as infile: 
      for line in infile: 
       try: 
        line.decode('ascii') 
       except UnicodeError as e: 
        ascii_chars = [] 
        for char in line: 
         try: 
          char.decode('ascii') 
         except UnicodeError as e2: 
          continue # ignore non-ascii characters 
         else: 
          ascii_chars.append(char) 
        line = ''.join(ascii_chars) 
       self._data_bytes.append(str(line))