以下程序在兩個文件(txt,〜10MBa)上運行約22小時。每個文件都有大約100K行。有人能給我一個關於我的代碼效率低下的指示,也許是一種更快的方法。輸入字典是有序的,維護秩序是必要的:比較兩個字典的效率
import collections
def uniq(input):
output = []
for x in input:
if x not in output:
output.append(x)
return output
Su = {}
with open ('Sucrose_rivacombined.txt') as f:
for line in f:
(key, val) = line.split('\t')
Su[(key)] = val
Su_OD = collections.OrderedDict(Su)
Su_keys = Su_OD.keys()
Et = {}
with open ('Ethanol_rivacombined.txt') as g:
for line in g:
(key, val) = line.split('\t')
Et[(key)] = val
Et_OD = collections.OrderedDict(Et)
Et_keys = Et_OD.keys()
merged_keys = Su_keys + Et_keys
merged_keys = uniq(merged_keys)
d3=collections.OrderedDict()
output_doc = open("compare.txt","w+")
for chr_local in merged_keys:
line_output = chr_local
if (Et.has_key(chr_local)):
line_output = line_output + "\t" + Et[chr_local]
else:
line_output = line_output + "\t" + "ND"
if (Su.has_key(chr_local)):
line_output = line_output + "\t" + Su[chr_local]
else:
line_output = line_output + "\t" + "ND"
output_doc.write(line_output + "\n")
輸入文件如下:並不是每一個關鍵是存在於文件
Su:
chr1:3266359 80.64516129
chr1:3409983 100
chr1:3837894 75.70093458
chr1:3967565 100
chr1:3977957 100
Et:
chr1:3266359 95
chr1:3456683 78
chr1:3837894 54.93395855
chr1:3967565 100
chr1:3976722 23
我想輸出到如下所示:
chr1:3266359 80.645 95
chr1:3456683 ND 78
爲什麼不簡介它更小的投入,看看自己在的時間被消耗? – NPE
我不知道該怎麼做。我之前在一半大小的文件上運行它,並且只花了大約3個小時。CPU使用率爲25%,RAM僅爲1.6GB,約6GB備用,所以它不像它的壓力資源。我只是想知道我是否錯誤地編碼了某些東西,導致它不必要地繼續閱讀文件。 – jobrant
您是否認證過該產品?因爲'Su'是一個正常的字典,所以當你將它轉換爲'Su_OD'時,你已經失去了文件系統的順序。你可能想在前面創建有序的字典嗎? –