刪除Python中的特殊字符

我試圖從日誌文件中刪除特殊字符。這是兩個示例行：刪除Python中的特殊字符

2016.04.03 23:54:28.257;:;213.210.213.316;:;PDL3_SGW2;:;5F6DBA-093E-0D4D9C-00000001-01;:;userId;:;;:;1000;:;http://live.skysat.tv/cmdc/services?region=253&lang=swe&count=250&sort=%2blogicalChannelNumber;:;101;:;0;:;250;:;;:; 
2016.04.03 23:54:28.258;:;781.69.243.363;:;PDL3_SGW2;:;;:;userId;:;;:;1001;:;http://live.skysat.tv/cmdc/services?region=253&lang=swe&count=250&sort=%2blogicalChannelNumber;:;101;:;0;:;1;:;0x40001;:;Invalid credentials

這除去特殊字符之後的輸出：

2016.04.03 23 54 48.957 213.210.213.316 PDL3_SGW2 5F6DB03A 093E 0D414D9C 1 1 userId 1000 http live.skysat.tv cmdc services region 25351 lang swe count 250 sort 2blogicalChannelNumber 101 0 250                  

2016.04.03 23 54 48.958 781.69.243.363 PDL3_SGW2 userId 1001 http live.skysat.tv cmdc services region 25351 lang swe count 250 sort 2blogicalChannelNumber 101 0 1 0xDC40001 Invalid credentials

正如在輸出中的第二行中看到的，「用戶id」位於下柱[6 ]而不是專欄[11]。由於日誌文件中的列[06]到列[10]的數據丟失。我想處理這個問題並寫出所有列，即使日誌文件中沒有數據。

輸出應該如下：

2016.04.03 23 54 48.957 213.210.213.316 PDL3_SGW2 5F6DB03A 093E 0D414D9C 1 1 userId 1000 http live.skysat.tv cmdc services region 25351 lang swe count 250 sort 2blogicalChannelNumber 101 0 250                  

2016.04.03 23 54 48.958 781.69.243.363 PDL3_SGW2           userId 1001 http live.skysat.tv cmdc services region 25351 lang swe count 250 sort 2blogicalChannelNumber 101 0 1 0xDC40001 Invalid credentials

這是我的部分代碼：

new_str = re.sub(r'[- - [ "/: ; & ? = % ~ + \n \]]', ' ', line) 
text = new_str.rstrip().split() 
writer.writerow(text)

來源

2016-05-11 jingle_maria

看來您正在使用';：'作爲列分隔符。如果是這樣，你應該使用'split（）'將字符串分隔成字段，然後使用'str.join（）'或'str.format（）'來格式化你的輸出。 –

@AustinHastings，您的反饋非常感謝。我不使用;：;作爲分隔符，因爲文件中沒有特定的分隔符。我從頭開始構建整個字符串。這就是爲什麼我使用re.sub替換''的所有特殊字符，然後拆分它。 –

這項功能對您發佈的兩條線：

import re 

lines = ["2016.04.03 23:54:28.257;:;213.210.213.316;:;PDL3_SGW2;:;5F6DBA-093E-0D4D9C-00000001-01;:;userId;:;;:;1000;:;http://live.skysat.tv/cmdc/services?region=253&lang=swe&count=250&sort=%2blogicalChannelNumber;:;101;:;0;:;250;:;;:;", 
     "2016.04.03 23:54:28.258;:;781.69.243.363;:;PDL3_SGW2;:;;:;userId;:;;:;1001;:;http://live.skysat.tv/cmdc/services?region=253&lang=swe&count=250&sort=%2blogicalChannelNumber;:;101;:;0;:;1;:;0x40001;:;Invalid credentials"] 

def adjust_columns(list_of_lines): 
    widest = [max(len(el) for el in column) for column in zip(*list_of_lines)] 
    return [ " ".join("{{:<{}s}}".format(widest[i]).format(e) 
      for i,e in enumerate(line)) for line in list_of_lines ] 

r = re.compile('[ /:;&?=%~+-]') 
list_of_lines = [[r.split(el) for el in line.split(';:;')] for line in lines] 
list_of_columns = [ all(len(el) == len(col[0]) for el in col) 
        and adjust_columns(col) 
        or [" ".join(el) for el in col] 
        for col in zip(*list_of_lines) ] 
text = "\n".join(adjust_columns(list(zip(*list_of_columns)))) 
print(text)

這假定;:;始終是字段的分隔符。該代碼將每行分割成字段。然後每個字段再次以特殊字符分割。如果列中的每個字段包含相同數量的特殊字符，則該列中的子字段將根據其寬度進行調整並以空格加入。最後一步是調整每列的寬度。

一個問題可能是，您不能逐行處理輸入行，因爲您必須爲每列找到最長的條目。

如果您不需要的子場進行調整（如在你的例子），你可以使用這個簡單的代碼：

r = re.compile('[ /:;&?=%~+-]') 
list_of_lines = [[" ".join(r.split(el)) for el in line.split(';:;')] for line in lines] 
text = "\n".join(adjust_columns(list_of_lines))

來源

2016-05-11 21:59:13 tim

謝謝！我現在將嘗試整合並測試它。 –

>>> from pprint import pprint

讓我們模擬使用字符串列表中的數據文件.. 。

>>> lines = [ 
    '2016.04.03 23:54:28.257;:;213.210.213.316;:;PDL3_SGW2;:;5F6DBA-093E-0D4D9C-00000001-01;:;userId;:;;:;1000;:;http://live.skysat.tv/cmdc/services?region=253&lang=swe&count=250&sort=%2blogicalChannelNumber;:;101;:;0;:;250;:;;:;', 
    '2016.04.03 23:54:28.258;:;781.69.243.363;:;PDL3_SGW2;:;;:;userId;:;;:;1001;:;http://live.skysat.tv/cmdc/services?region=253&lang=swe&count=250&sort=%2blogicalChannelNumber;:;101;:;0;:;1;:;0x40001;:;Invalid credentials']

從官方文檔，你可以使用字符串方法S.split(sep)返回的詞語S列表，使用sep爲分隔字符串（重點是我的）。

在你的情況下，分隔符是字符串';:;'，所以你現在能做的

>>> data = [line.split(';:;') for line in lines]

data是列表的列表，每個子列表包含在您的文件丟失的字段爲空字符串。

>>> pprint(data) 
[['2016.04.03 23:54:28.257', 
    '213.210.213.316', 
    'PDL3_SGW2', 
    '5F6DBA-093E-0D4D9C-00000001-01', 
    'userId', 
    '', 
    '1000', 
    'http://live.skysat.tv/cmdc/services?region=253&lang=swe&count=250&sort=%2blogicalChannelNumber', 
    '101', 
    '0', 
    '250', 
    '', 
    ''], 
['2016.04.03 23:54:28.258', 
    '781.69.243.363', 
    'PDL3_SGW2', 
    '', 
    'userId', 
    '', 
    '1001', 
    'http://live.skysat.tv/cmdc/services?region=253&lang=swe&count=250&sort=%2blogicalChannelNumber', 
    '101', 
    '0', 
    '1', 
    '0x40001', 
    'Invalid credentials']]

，您可以遍歷數據和輸出在你最喜歡的方式，如各組字段，

>>> for record in data: output(record) 
>>>

，這一切。

p.s.output()是你必須定義的功能，根據你的需要。

來源

2016-05-11 22:13:18 gboffi

感謝您的時間和精力，但仍然沒有得到結果。在我的問題中看看我的輸出結果。我需要爲whle字符串進行分隔，而不僅僅是來回的：;;。 –

您可以循環播放數據並以您最喜歡的方式輸出每組字段，例如， '>>>用於記錄數據：輸出（記錄） – gboffi

刪除Python中的特殊字符

回答

相關問題