使用re
模塊,構建一個正則表達式:
regex = r"""
( # Capture in group #1
"[\w\s]+" # Three sequences of quoted letters and white space characters
\s+ # followed by one or more white space characters
"[\w\s]+"
\s+
"[\w\s]+"
\s+
)
"(\d{10,})" # Match a quoted set of at least 10 integers into group #2
(^^\s+\.\s+) # Match by two circumflex characters, whitespace and a period
# into group #3
(.*) # Followed by anything at all into group #4
"""
COMPILED_REGEX = re.compile(regex, re.VERBOSE)
接下來,我們需要定義一個回調函數(因爲re.RegexObject.sub
採用了回調)來處理更換:
def replace_callback(matches):
full_line = matches.group(0)
number_text = matches.group(2)
number_of_interest = int(number_text, base=10)
if number_of_interest > 2147483647:
return full_line.replace(number_of_interest, number_text[:3])
else:
return full_line
然後查找和替換:
fixed_data = COMPILED_REGEX.sub(replace_callback, YOUR_DATA)
如果你有一個terrabyte的數據,你可能不想在內存中這樣做 - 你需要打開這個文件,然後迭代它,逐行替換數據並將它寫回到另一個文件(毫無疑問,這可以加快速度,但他們會使技術的難點難以遵循:
# Given the above
def process_data():
with open("path/to/your/file") as data_file,
open("path/to/output/file", "w") as output_file:
for line in data_file:
fixed_data = COMPILED_REGEX.sub(replace_callback, line)
output_file.write(fixed_data)