2017-03-16 19 views
0

我有兩個大文件的數據集:如何提高比較兩個列表和範圍之間的值的python腳本的速度?

File1: 
Gen1 1 1 10 
Gen2 1 2 20 
Gen3 2 30 40 

File2: 
A 1 4 
B 1 15 
C 2 2 

預期輸出:

Out: 
Gen1 1 1 10 A 1 4 
Gen2 1 2 20 B 1 15 

現在我基本上只是試圖找到實例,其中文件2文件1,如果代碼的文件2 [ 1]匹配文件1 [1]和在文件中的範圍介於1

我的代碼,這是否是以下:

for i in file1: 

    temp = i.split() 

    for a in file2: 

     temp2 = a.split() 

     if temp[1] == temp2[1] and temp2[2] >= temp[2] and temp2[2] <= temp[3] 

      print(i + " " + a + "\n") 

     else: 

      continue 

該代碼有效,但我覺得需要比預期更長的時間。有沒有更簡單的方法或方法來做到這一點?我覺得有一些巧妙的使用地圖或哈希,我不這樣做。

謝謝!

+0

40 30似乎並不像一個有效的範圍是多少? –

+0

正確我應該解決這個問題! – perot57

+1

使用熊貓,這使用一個編譯的後端,將是一個班輪 – maxymoo

回答

0

熊貓可能是一個不錯的選擇。請參閱this示例。

當文件很大時,我更喜歡sqlite而非熊貓。熊貓數據框可以從sqlite數據庫加載。

import sqlite3 

file1 = """Gen1 1 1 10 
Gen2 1 2 20 
Gen3 2 30 40""" 

file2 = """A 1 4 
B 1 15 
C 2 2""" 

# your code (fixed) 
print("desired output") 
for i in file1.splitlines(): 
    temp = i.split() 
    for a in file2.splitlines(): 
     temp2 = a.split() 
     if temp[1] == temp2[1] and int(temp2[2]) >= int(temp[2]) and int(temp2[2]) <= int(temp[3]): 
      print(i + " " + a) 


# Make an in-memory db 
# Set a filename if your files are too big or if you want to reuse this db 
con = sqlite3.connect(":memory:") 
c = con.cursor() 

c.execute("""CREATE TABLE file1 
(
    gene_name text, 
    a integer, 
    b1 integer, 
    b2 integer 
)""") 

for row in file1.splitlines(): 
    if row: 
     c.execute("INSERT INTO file1 (gene_name, a, b1, b2) VALUES (?,?,?,?)", tuple(row.split())) 

c.execute("""CREATE TABLE file2 
(
    name text, 
    a integer, 
    b integer 
)""") 

for row in file2.splitlines(): 
    if row: 
     c.execute("INSERT INTO file2 (name, a, b) VALUES (?,?,?)", tuple(row.split())) 

# join tow tables 
print("sqlite3 output") 
for row in c.execute("""SELECT 
    file1.gene_name, 
    file1.a, 
    file1.b1, 
    file1.b2, 
    file2.name, 
    file2.a, 
    file2.b 
FROM file1 
JOIN file2 ON file1.a = file2.a AND file2.b >= file1.b1 AND file2.b <= file1.b2 
"""): 
    print(row) 

con.close() 

輸出:

desired output 
Gen1 1 1 10 A 1 4 
Gen2 1 2 20 A 1 4 
Gen2 1 2 20 B 1 15 
sqlite3 output 
(u'Gen1', 1, 1, 10, u'A', 1, 4) 
(u'Gen2', 1, 2, 20, u'A', 1, 4) 
(u'Gen2', 1, 2, 20, u'B', 1, 15)