2016-03-08 48 views
0

我有file2與~1400 $5值之前-是「未知」。我正在嘗試使用file2$2中的文本更新file1中的那些「未知」值。在$1file1中有一組數字,可用於更新「未知」,如果它在的$4的範圍內。我真的不知道從哪裏開始,但下面的awk是一個開始,或者可能有更好的方法。謝謝 :)。awk更新文件,如果值範圍內

file1的

 `$1`   `$2` 
chr6:3224495-3227968 TUBB2B 
chr16:89988417-90002505 TUBB3 

file2的

chr16 89985657 89986630 chr16:89985657-89986630 MC1R-2270|gc=63.5 
chr16 89989779 89989898 chr16:89989779-89989898 unknown-2271|gc=73.9 
chr16 89998969 89999097 chr16:89998969-89999097 unknown-2272|gc=57 
chr16 89999866 89999996 chr16:89999866-89999996 unknown-2273|gc=55.4 
chr16 90001127 90002222 chr16:90001127-90002222 unknown-2274|gc=63.9 
chr17 1173848 1174575 chr17:1173848-1174575 BHLHA9-3|gc=78.7 

期望的輸出unknown updated to TUBB3 because the TUBB3 because the $4 value is within the range of $1)。

chr16 89985657 89986630 chr16:89985657-89986630 MC1R-2270|gc=63.5 
chr16 89989779 89989898 chr16:89989779-89989898 TUBB3-2271|gc=73.9 
chr16 89998969 89999097 chr16:89998969-89999097 TUBB3-2272|gc=57 
chr16 89999866 89999996 chr16:89999866-89999996 TUBB3-2273|gc=55.4 
chr16 90001127 90002222 chr16:90001127-90002222 TUBB3-2274|gc=63.9 
chr17 1173848 1174575 chr17:1173848-1174575 BHLHA9-3|gc=78.7 

AWK

awk ' 
NR == FNR {min[$1]=$4; next} 
{ 
    for (id in min) 
     if ([id] = $5 && [id]) { 
      print $0, id 
      break 
     } 
} 
' file1 file2 

編輯:

awk -v OFS='\t' 'NR==FNR{split($1,a,/[:-]/) 
         rstart[a[1]]=a[2] 
         rend[a[1]]=a[3] 
         value[a[1]]=$2 
         next} 
$5~/unknown/ && $2>=rstart[$1] && $3<=rend[$1] 
         {sub(/unknown/,value[$1],$5)}1' file1 file2 | 
column -t > output 


chr16 89985657 89986630 chr16:89985657-89986630 MC1R-2270|gc=63.5 
chr16 89989779 89989898 chr16:89989779-89989898 unknown-2271|gc=73.9 
chr16 89989779 89989898 chr16:89989779-89989898 TUBB3-2271|gc=73.9 
chr16 89998969 89999097 chr16:89998969-89999097 unknown-2272|gc=57 
chr16 89998969 89999097 chr16:89998969-89999097 TUBB3-2272|gc=57 
chr16 89999866 89999996 chr16:89999866-89999996 unknown-2273|gc=55.4 
chr16 89999866 89999996 chr16:89999866-89999996 TUBB3-2273|gc=55.4 
chr16 90001127 90002222 chr16:90001127-90002222 unknown-2274|gc=63.9 
chr16 90001127 90002222 chr16:90001127-90002222 TUBB3-2274|gc=63.9 
chr17 1173848 1174575 chr17:1173848-1174575 BHLHA9-3|gc=78.7 
+0

我想你在文本中混淆了'file1'和'file2'幾次。 –

回答

2

awk來救援!

$ awk -v OFS='\t' 'NR==FNR{split($1,a,/[:-]/) 
          rstart[a[1]]=a[2] 
          rend[a[1]]=a[3] 
          value[a[1]]=$2 
          next} 
    $5~/unknown/ && $2>=rstart[$1] && $3<=rend[$1] 
          {sub(/unknown/,value[$1],$5)}1' file1 file2 | 
    column -t 

chr16 89985657 89986630 chr16:89985657-89986630 MC1R-2270|gc=63.5 
chr16 89989779 89989898 chr16:89989779-89989898 TUBB3-2271|gc=73.9 
chr16 89998969 89999097 chr16:89998969-89999097 TUBB3-2272|gc=57 
chr16 89999866 89999996 chr16:89999866-89999996 TUBB3-2273|gc=55.4 
chr16 90001127 90002222 chr16:90001127-90002222 TUBB3-2274|gc=63.9 
chr17 1173848 1174575 chr17:1173848-1174575 BHLHA9-3|gc=78.7 

修改原始間距,以便通過管道輸送到column -t爲表格格式。

+1

'rstart'不是變量名稱的最佳選擇,它與內置的awk變量太相似:https://www.gnu.org/software/gawk/manual/html_node/Auto_002dset.html#Auto_002dset –

+1

另外,你只需要一個分割:'split($ 1,a,/ [: - ] /)' –

+0

非常感謝你:)。 – Chris