2017-02-15 50 views
0

我只是想知道是否有更好的方法來做這個算法。我發現我需要經常進行這種類型的操作,而且我現在這樣做的方式需要幾個小時,因爲我相信它會被認爲是n^2算法。我會在下面附上。更有效的方法來做這個搜索算法?

import csv 

with open("location1", 'r') as main: 
    csvMain = csv.reader(main) 
    mainList = list(csvMain) 

with open("location2", 'r') as anno: 
    csvAnno = csv.reader(anno) 
    annoList = list(csvAnno) 

tempList = [] 
output = [] 

for full in mainList: 
    geneName = full[2].lower() 
    for annot in annoList: 
     if geneName == annot[2].lower(): 
      tempList.extend(full) 
      tempList.append(annot[3]) 
      tempList.append(annot[4]) 
      tempList.append(annot[5]) 
      tempList.append(annot[6]) 
      output.append(tempList) 

     for i in tempList: 
      del i 

with open("location3", 'w') as final: 
    a = csv.writer(final, delimiter=',') 
    a.writerows(output) 

我有一個包含每個15000要把兩個CSV文件,我期待從每列比較,如果它們匹配,拼接第二CSV年底到第一年底。任何幫助將不勝感激!

謝謝!

+0

Pro:適用於本地庫並且沒有外部依賴關係。 Con:大熊貓可以做得更容易,速度更快(如下所述)。無論是比較還是追加(我認爲這將是3或4行代碼) – Kelvin

回答

2

它應該是更有效的是這樣的:

import csv 
from collections import defaultdict 

with open("location1", 'r') as main: 
    csvMain = csv.reader(main) 
    mainList = list(csvMain) 

with open("location2", 'r') as anno: 
    csvAnno = csv.reader(anno) 
    annoList = list(csvAnno) 

output = [] 
annoMap = defaultdict(list) 

for annot in annoList: 
    tempList = annot[3:] # adapt this to the needed columns 
    annoMap[annot[2].lower()].append(tempList) # put these columns into the map at position of the column of intereset 

for full in mainList: 
    geneName = full[2].lower() 
    if geneName in annoMap: # check if matching column exists 
    output.extend(annoMap[geneName]) 

with open("location3", 'w') as final: 
    a = csv.writer(final, delimiter=',') 
    a.writerows(output) 

這是更有效,因爲你需要遍歷每個列表只有一次。字典中的查找平均爲O(1),因此您基本上可以獲得線性算法。

+2

如果您解釋*爲什麼*您的更改使其更有效,這可能會有所幫助。 – Paul

+0

美麗!這幾乎是完美的,儘管它只是將映射值打印到文件中。將* full *變量和* annoMap [geneName] *一起作爲一個長字符串是一個簡單的修復。非常感謝! –

1

一個簡單的方法是使用像Pandas這樣的庫。內置的功能非常高效。

您可以使用pandas.read_csv()將您的csv加載到數據框中,然後使用pandas函數對其進行處理。

例如,您可以使用Pandas.merge()將兩個數據框(又名您的兩個csv文件)合併到特定的列,然後刪除不需要的那個。

如果您有一些數據庫知識,這裏的邏輯非常相似。

0

謝謝@limes的幫助。這是我用過的最後一個腳本,以爲我會發布它來幫助其他人。再次感謝!

import csv 
from collections import defaultdict 

with open("location1", 'r') as main: 
    csvMain = csv.reader(main) 
    mainList = list(csvMain) 

with open("location2", 'r') as anno: 
    csvAnno = csv.reader(anno) 
    annoList = list(csvAnno) 

output = [] 
annoMap = defaultdict(list) 

for annot in annoList: 
    tempList = annot[3:] # adapt this to the needed columns 
    annoMap[annot[2].lower()].append(tempList) # put these columns into the map at position of the column of intereset 

for full in mainList: 
    geneName = full[2].lower() 
    if geneName in annoMap: # check if matching column exists 
    list = annoMap[geneName] 
    full.extend(list[0]) 
    output.append(full) 

with open("location3", 'w') as final: 
a = csv.writer(final, delimiter=',') 
a.writerows(output) 
相關問題