在某些條件下過濾出行

我想過濾出具有與其他文件的值匹配的特定值的行。我會很感激的幫助。在某些條件下過濾出行

我的數據是這樣的：

文件1：

Group Position Code  Answer c1  c2 c3 c4 
    1  3  s1_60 A  etc etc etc etc 
    2  4  s2_63 T  etc2_ etc2 etc2/ etc2' 
    3  5  s1_23 A  etc3 etc3 etc3* etc3 
    3  51  s7_52 T  etc4 etc4_ etc4 etc4^

文件2：

>1 
ATGCGCGCGCGCGATATATTGCTGATATATATGCCTTttaagatcaatat 
tattagccccatgtgttgaagaacaaatctctctgttaaacagaaattgg 
gggggaaaataaacaggggggcaaataattctgactacaattgtatatat 
ggatatattGCGCGCGCGCGAGAGAGAGAGAtgtgttgtagataGACGAG 
>2 
tattagccccatgtgttgaagaacaaatctctctgttaaacagaaattgg 
gggggaaaataaacaggggggcaaataattctgactacaattgtatatat 
ggatatattGCGCGCaaaaaaGAGAGAGAGAGAtgtgttgtagataGACG 
>3 
tattagccccatgtgttgaagaacaaatctctctgttaaacagaaattgg 
gggggaaaataaacaggggggcaaataattctgactacaattgtatatat 
ggatatattGCGCGCGCGccggcgcgcgAGAtgtgttgtagataGACGAG

'組' 是指號碼後 '>' 上 '文件2'，而「位置'指的是指定組別下的信件位置。我只想保留'Answer'列中'File2'的匹配字母的行。

因此，輸出應該是這樣的：

newOutput：

Group Position Code  Answer c1  c2 c3 c4 
    2  4  s2_63 T  etc2_ etc2 etc2/ etc2' 
    3  5  s1_23 A  etc3 etc3 etc3* etc3 
    3  51  s7_52 T  etc4 etc4_ etc4 etc4^

在 '文件1' 的第一行不包括在內，因爲它有 'A'，而不是 'K'

我將不勝感激任何幫助。我正在考慮從awk或python開始。我從來沒有組織涉及多個文件的數據，所以這對我來說有點令人沮喪。請建議我。

來源

2014-11-03 user3557715

import csv 

with open("File2") as infile: 
    d = {} 
    bases = '' 
    group = None 
    for line in infile: 
     line = line.strip() 
     if line.startswith(">"): 
      if group is not None: 
       d[group] = bases 
      group = int(line[1:]) 
      bases = '' 
      continue 
     bases += line 
    d[group] = bases.upper() 

with open("File1") as infile, open('output', 'w') as outfile: 
    reader = csv.reader(infile, delimiter='\t') 
    writer = csv.writer(outfile, delimiter='\t') 
    writer.writerow(next(reader)) 
    for g, pos, code, answer, *rest in reader: 
     g = int(g) 
     pos = int(pos) 
     if d[g][pos-1] == ans: 
      writer.writerow([g, pos, code, answer] + rest)

來源

2014-11-03 05:09:21 inspectorG4dget

它說： d [group] = line.strip（） NameError：name'line'未定義我做錯了什麼？ – user3557715 2014-11-03 06:18:53

@ user3557715：哎呀！對於那個很抱歉。現在修復了 – inspectorG4dget 2014-11-03 06:23:08

謝謝！我也注意到了它。但我有另一個問題。 on「group = int（group [1：]。strip（））」我認爲它對不以「>開頭的線條進行分條」。它顯示類似於「ValueError：無效文字爲int（）與基地10：'ALKFEKSSGESDGASHSDG'」有什麼辦法我可以適用於只有朝着以'>'開頭的？ – user3557715 2014-11-03 06:27:29

下面是一個AWK溶液：

BEGIN { 
    GROUP=1; 
    BASE=2; 
} 
NR == FNR { 
    positions[$1"_"$2]=toupper($3) 
} 

NR != FNR { 
    if($0 ~ /^>/) { 
     group=substr($0, 2, length($0)); 
    } else { 
     gsub(" ", "", $0); 
     seqs[group]=seqs[group]$0; 
    } 
} 

END { 
    print "Group","Position","Answer" 
    for(current_group in seqs) { 
     for(key in positions) { 
      split(key,position,"_"); 
      if(position[GROUP] == current_group) { 
       if(toupper(substr(seqs[group],position[BASE],1)) \ 
         == positions[key]) { 
        print position[GROUP], 
          position[BASE], 
          positions[key]; 
       } 
      } 
     } 
    } 
}

awk -f script.awk File1 File2

輸出：組3的

Group Position Answer 
2 4 T 
3 5 A

位置51似乎是一個G，而不是一個T，所以我輸出與你的不同。

來源

2014-11-03 07:43:26 qwwqwwq

對不起，我有額外的列，你可以從編輯的OP看到。我應該改變什麼來反映這些額外的列？ – user3557715 2014-11-03 18:34:03

你可以對我提供的代碼進行一些表面編輯，即將額外的列存儲在位置關聯數組中，然後再次讀出它們，雖然改變你的問題對我來說有點不公平:) – qwwqwwq 2014-11-03 20:06:49

在某些條件下過濾出行

回答

相關問題