我要讀取,解析和集成兩個巨大的文本文件作爲輸入,然後創建新文件。
還有另外一個文件用於解析。
簡要解釋一下,兩個文本文件有大約1億行和3列。
首先,讀取兩個不同的文件,並將匹配的兩個值寫入新文件。
如果輸入文件中沒有匹配值,則0.0將插入每行的矩陣中。
爲了提高這個解析的效率,我做了另一個輸入文件,它是關於來自兩個文本文件的第一列(鍵)的聯合文件,如下所示。
我用小輸入文件(10000行)測試了這段代碼。它運作良好。我在兩天前開始使用巨大的數據集運行此代碼,但不幸的是它仍在運行。
如何減少運行時間並有效解析它?Python)如何減少解析大數據集的運行時間
1st_infile.txt
MARCH2_MARCH2 2.3 0.1
MARCH2_MARC2 -0.2 0
MARCH2_MARCH5 -0.3 0.3
MARCH2_MARCH6 -1.4 0
MARCH2_MARCH7 0.1 0
MARCH2_SEPT2 -1.0 0
MARCH2_SEPT4 0.8 0
2nd_infile.txt
MARCH2_MARCH2 2.2 0
MARCH2_MARCH2.1 0.2 0
MARCH2_MARCH3 -0.4 0
MARCH2_MARCH5 -0.3 0
MARCH2_MARCH6 -0.6 0
MARCH2_MARCH7 1.2 0
MARCH2_SEPT2 0.2 0
union_file.txt
MARCH2_MARCH2
MARCH2_MARCH2.1
MARCH2_MARC2
MARCH2_MARCH5
MARCH2_MARCH6
MARCH2_MARCH7
MARCH2_SEPT2
MARCH2_SEPT4
MARCH2_MARCH3
Outfile.txt
MARCH2_MARCH2 2.3 0.1 2.2 0
MARCH2_MARCH2.1 0.0 0.0 0.2 0
MARCH2_MARC2 -0.2 0 0.0 0.0
MARCH2_MARCH5 -0.3 0.3 -0.3 0
MARCH2_MARCH6 -1.4 0 -0.6 0
MARCH2_MARCH7 1.2 0 1.2 0
MARCH2_SEPT2 -1.0 0 0.2 0
MARCH2_SEPT4 0.8 0 0.0 0.0
MARCH2_MARCH3 0.0 0.0 -0.4 0
Python.py
def load(filename):
ret = {}
with open(filename) as f:
for lineno, line in enumerate(f, 1):
try:
name, value1, value2 = line.split()
except ValueError:
print('Skip invalid line {}:{}L {0!r}'.format(filename, lineno, line))
continue
ret[name] = value1, value2
return ret
a, b = load('1st_infile.txt'), load('2nd_infile.txt')
with open ('Union_file.txt') as f:
with open('Outfile.txt', 'w') as fout:
for line in f:
name = line.strip()
fout.write('{0:<20} {1[0]:>5} {1[1]:>5} {2[0]:>5} {2[1]:>5}\n'.format(
name,
a.get(name, (0, 0)),
b.get(name, (0, 0))
))