2017-01-02 49 views
0

我有其內部看起來是這樣的文件之間擺脫boundry範圍重疊的:Python的 - 如何讓行

1 33725 36725 ENHANCER0002 1 711760 714760 ENHANCER0003 1 724150 727150 ENHANCER0004 1 725455 728455 ENHANCER0005 1 871280 874410 ENHANCER0006 1 874180 877180 ENHANCER0007 1 900540 903540 ENHANCER0008 1 901475 904475 ENHANCER0009 1 910260 913260 ENHANCER00010 1 933355 936355 ENHANCER00011 1 947660 950660 ENHANCER00012 1 1013530 1016530 ENHANCER00013 . . . 1 2477030 2480030 ENHANCER00043 1 2478160 2481160 ENHANCER00044 1 2478845 2481845 ENHANCER00045

中間兩列是我的下限和上限。就像第3-4行或第5-6行,邊界重疊。我必須以某種方式重塑它,如果邊界重疊,它只會打印最低的下邊界和最高的上邊界。我使用Python這樣的解決方案,這是我的代碼:

def write_line(chr_no,tmp_l,tmp_h,cnt,filename): 
    filename.write(str(chr_no)+"\t"+str(tmp_l)+"\t"+str(tmp_h)+"\t"+"ENHANCER000"+str(cnt)+"\n") 


inf = open("/home/firat/Desktop/Onder_Lab/Kenan/enhancers_bj.bed","r") 
outf = open("/home/firat/Desktop/deneme_v3.bed","w") 

cnt = 0 
tmp_l=0 
tmp_h=0 

tmp_list = [] 

for line in inf: 
    cnt += 1 
    line = line.split(' ') 
    current_low = line[1] 
    current_high = line[2] 
    previous_low = tmp_l 
    previous_high = tmp_h 
    if (int(current_low) <= int(previous_high)): 
     tmp_list.append(int(current_low)) 
     tmp_list.append(int(current_high)) 
     tmp_list.append(int(previous_low)) 
     tmp_list.append(int(previous_high)) 
     write_line(line[0],min(tmp_list),max(tmp_list),cnt,outf) 
     tmp_l = min(tmp_list) 
     tmp_h = max(tmp_list) 
     tmp_list = [] 
    else: 
     write_line(line[0], previous_low, previous_high, cnt, outf) 
     tmp_l= current_low 
     tmp_h= current_high 

雖然我的解決方案看起來有效,輸出是這樣的:

1 27460 30460 ENHANCER0002 1 33725 36725 ENHANCER0003 1 711760 714760 ENHANCER0004 1 724150 728455 ENHANCER0005 1 724150 728455 ENHANCER0006 1 871280 877180 ENHANCER0007 1 871280 877180 ENHANCER0008 1 900540 904475 ENHANCER0009 1 900540 904475 ENHANCER00010 1 910260 913260 ENHANCER00011 1 933355 936355 ENHANCER00012 1 947660 950660 ENHANCER00013 1 1013530 1016530 ENHANCER00014 . . . 1 2477030 2481160 ENHANCER00044 1 2477030 2481845 ENHANCER00045 1 2477030 2481845 ENHANCER00046 作爲注意到,有重複印刷時,有邊界的重疊。還有一些情況下,3條線重疊,就像在底部一樣。預期結果應該是這樣的:

1 27460 30460 ENHANCER0002 1 33725 36725 ENHANCER0003 1 711760 714760 ENHANCER0004 1 724150 728455 ENHANCER0005 1 871280 877180 ENHANCER0006 1 900540 904475 ENHANCER0007 1 910260 913260 ENHANCER0008 . . . 1 2477030 2481845 ENHANCER00046

什麼是錯我的代碼,我怎麼能提高,即使有一個更比2條線重疊它的工作?

回答

0

您的代碼似乎是一個簡單的任務過於複雜。您不需要使用四個變量 - tmp_l,tmp_h,previous_low和previous_high。您也不需要維護重疊間隔的當前列表。你所需要做的就是保持重疊間隔的低和高。

與您的代碼的問題,然而,就是你叫write_line每次迭代。只有在當前低點高於前一個高點時纔會調用write_line,這意味着前一組重疊間隔已經結束,並且在循環結束時也會結束。

下面的代碼將工作:

for line in inf.splitlines(): 
    cnt += 1 
    line = line.split(' ') 
    current_low = int(line[1]) 
    current_high = int(line[2]) 
    if current_low <= previous_high: 
     previous_high = current_high 
    else: 
     if previous_high > 0: 
      write_line(line[0], previous_low, previous_high, cnt, outf) 
     previous_low = current_low 
     previous_high = current_high 

if previous_high > 0: 
    write_line(line[0], previous_low, previous_high, cnt, outf) 

支票if previous_high > 0是需要不輸出previous_low和默認值previous_high - 0,0額外write_line在for循環,需要在2002年底輸出最後一組重疊間隔。當有2個以上重疊的間隔

此代碼會工作,太。