2014-02-25 39 views
-1

我有一個包含大約177071007項的列表。 我試圖執行以下操作 a)獲取列表中唯一項目的第一個和最後一個發生。 b)發生的次數。蟒蛇中的巨大列表的時間複雜度2.7

def parse_data(file, op_file_test): 
    ins = csv.reader(open(file, 'rb'), delimiter = '\t') 
    pc = list() 
    rd = list() 
    deltas = list() 
    reoccurance = list() 
    try: 
     for row in ins: 
      pc.append(int(row[0])) 
      rd.append(int(row[1])) 
    except: 
     print row 
     pass 

    unique_pc = set(pc) 
    unique_pc = list(unique_pc) 
    print "closing file" 

    #takes a long time from here! 
    for a in range(0, len(unique_pc)): 
     index_first_occurance = pc.index(unique_pc[a]) 
     index_last_occurance = len(pc) - 1 - pc[::-1].index(unique_pc[a]) 
     delta_rd = rd[index_last_occurance] - rd[index_first_occurance] 
     deltas.append(int(delta_rd)) 
     reoccurance.append(pc.count(unique_pc[a])) 
     print unique_pc[a] , delta_rd, reoccurance[a] 

    print "printing to file" 
    map_file = open(op_file_test,'a') 
    for a in range(0, len(unique_pc)): 
     print >>map_file, "%d, %d, %d" % (unique_pc[a], deltas[a], reoccurance) 
    map_file.close() 

然而,複雜度是按照O(n)的順序。 是否有可能讓for循環「跑得快」,我的意思是,你認爲屈服會讓它快嗎?或者還有其他方法嗎?不幸的是,我沒有numpy

回答

1

嘗試以下操作:

from collections import defaultdict 

# Keep a dictionary of our rd and pc values, with the value as a list of the line numbers each occurs on 
# e.g. {'10': [1, 45, 79]} 
pc_elements = defaultdict(list) 
rd_elements = defaultdict(list) 

with open(file, 'rb') as f: 
    line_number = 0 
    csvin = csv.reader(f, delimiter='\t') 
    for row in csvin: 
     try: 
      pc_elements[int(row[0])].append(line_number) 
      rd_elements[int(row[1])].append(line_number) 
      line_number += 1 
     except ValueError: 
      print("Not a number") 
      print(row) 
      line_number += 1 
      continue 

for pc, indexes in pc_elements.iteritems(): 
    print("pc {0} appears {1} times. First on row {2}, last on row {3}".format(
     pc, 
     len(indexes), 
     indexes[0], 
     indexes[-1] 
    )) 

這是通過創建一個字典,用pc值作爲重點,並出現作爲值列表中讀取的TSV時。由於字典的性質,密鑰必須是唯一的,因此我們避免使用setlist值僅用於保留髮生密鑰的行。

例子:

pc_elements = {10: [4, 10, 18, 101], 8: [3, 12, 13]} 

將輸出:

"pc 10 appears 4 times. First on row 4, last on row 101" 
"pc 8 appears 3 times. First on row 3, last on row 13" 
+0

如何繼續不同? :) – pistal

+0

@pistal --' continue'移到循環的下一個迭代(不在其後面運行任何代碼),而'pass'將在下一次迭代之前運行任何代碼。一個很好的答案與示例[here。](http://stackoverflow.com/a/9484008/1401034) – Ewan

0

嘗試替換列表的字典,在字典中查找是很多比長列表更快。

這可能是這樣的:

def parse_data(file, op_file_test): 
    ins = csv.reader(open(file, 'rb'), delimiter = '\t') 

    # Dict of pc -> [rd first occurence, rd last occurence, list of occurences] 
    occurences = {} 

    for i in range(0, len(ins)): 
    row = ins[i] 
    try: 
     pc = int(row[0]) 
     rd = int(row[1]) 
    except: 
     print row 
     continue 

    if pc not in occurences: 
     occurences[pc] = [rd, rd, i] 
    else: 
     occurences[pc][1] = rd 
     occurences[pc].append(i) 

    # (Remove the sorted is you don't need them sorted but need them faster) 
    for value in sorted(occurences.keys()): 
    print "value: %d, delta: %d, occurences: %s" % (
     value, occurences[value][1] - occurences[value][0], 
     ", ".join(occurences[value][2:]) 
+0

心靈拍攝的原型? – pistal

0

正如你從掃描輸入文件的項目,把項目變成一個collections.defaultdict(list)其中關鍵的是項目和值是occurence指數列表。讀取文件需要一定的時間,建立這個數據結構和固定的時間來得到一個項目的第一個和最後一個出現索引,並且用一個固定的時間來獲取一個項目的出現次數。

下面是它如何工作

mydict = collections.defaultdict(list) 
for item, index in itemfilereader: # O(n) 
    mydict[item].append(index) 

# first occurrence of item, O(1) 
mydict[item][0] 

# last occurrence of item, O(1) 
mydict[item][-1] 

# number of occurrences of item, O(1) 
len(mydict[item]) 
+0

什麼是你的itemfilereader? – pistal

+0

itemfilereader是你的csvreader加上一些東西,會給你一個項目發生的索引(行號)。記住'枚舉'。 – IceArdor

+0

你是不是指'ins'? – pistal

0

也許這是值得chaning使用的數據結構。我會使用一個字典,它使用pc作爲關鍵字,並將出現的值作爲值。

lookup = dict{} 
counter = 0 
for line in ins: 
    values = lookup.setdefault(int(line[0]),[]) 
    values.append(tuple(counter,int(line[1]))) 
    counter += 1 

for key, val in lookup.iteritems(): 
    value_of_first_occurence = lookup[key][1][1] 
    value_of_last_occurence = lookup[key][-1][1] 
    first_occurence = lookup[key][1][0] 
    last_occurence = lookup[key][-1][0] 
    value = lookup[key][0] 
+0

你能解釋一下代碼的作用嗎? – pistal

+0

對不起,發佈了錯誤的代碼。現在它已更新。我讀取每一行,並將'pc'作爲密鑰存儲在一個字典中,以及包含由csv reader('rd')返回的值以及出現的嵌套列表的列表。 – Denis

+0

它有什麼作用? – pistal