刪除含有30％以上小寫字母的行

我嘗試處理一些數據，但無法找到適用於我的問題的工作解決方案。我有一個文件，它看起來像：刪除含有30％以上小寫字母的行

>ram 
cacacacacacacacacatatacacatacacatacacacacacacacacacacacacaca 
cacacacacacacaca 
>pam 
GAATGTCAAAAAAAAAAAAAAAAActctctct 
>sam 
AATTGGCCAATTGGCAATTCCGGAATTCaattggccaattccggaattccaattccgg 

and many lines more....

我想篩選出所有行和相應的標頭（標頭>開始），其中序列串（那些不開始>）被含有30％或更多小寫字母。序列字符串可以跨越多行。

所以命令後XY輸出應該是這樣的：

>pam 
GAATGTCAAAAAAAAAAAAAAAAActctctct

我嘗試了while循環的一些混合讀取輸入文件，然後使用awk，grep的工作，sed的但沒有好結果。

來源

2017-02-21 JFS31

您做出了嘗試和失敗？向我們展示你的努力。 – Inian

另外'bash'不適合這個，因爲它不能計算浮點數的值，也不能進行比較。你可以很好地刪除'bash'標籤 – Inian

或者：

awk '{n=length(gensub(/[A-Z]/,"","g"));if(NF && n/length*100 < 30)print a $0;a=RT}' RS='>[a-z]+\n' file

RS='>[a-z]+\n' - 設置記錄分隔符包含「>行'和名稱
RT - 這個值是由什麼是RS上述
a=RT匹配的設置 - 保存以前的RT值
n=length(gensub(/[A-Z]/,"","g")); - 獲得下盒的長度，字符
if(NF && n/length*100 < 30)print a $0; - 檢查我們有一個值並且小寫字母的百分比小於30

來源

2017-02-21 15:10:25 grail

'gawk'只，但很好地完成。也許添加一些解釋。 – dawg

非常感謝。這有效，你的描述非常好，並幫助我瞭解發生了什麼。 – JFS31

這裏有一個想法，它設置在記錄分隔符「>」對待每個報頭與它的序列行作爲一個單一的記錄。

由於輸入以「>」開始，這會導致初始空記錄，因此我們使用NR > 1（記錄數大於1）來防止計算。

要算，我們的頭之後添加的所有線的長度字符數。爲了計算小寫字符的數量，我們將字符串保存到另一個變量中，並使用gsub將所有小寫字母替換爲無 - 僅僅因爲gsub返回所做的替換次數，這是一種方便的計數方式他們。

最後，我們檢查比率和打印或不（加回初始「>」當我們做打印）。

BEGIN { RS = ">" } 

NR > 1 { 
    total_cnt = 0 
    lower_cnt = 0 
    for (i=2; i<=NF; ++i) { 
     total_cnt += length($i) 
     s = $i 
     lower_cnt += gsub(/[a-z]/, "", s) 
    } 
    ratio = lower_cnt/total_cnt 
    if (ratio < 0.3) print ">"$0 
} 


$ awk -f seq.awk seq.txt 
>pam 
GAATGTCAAAAAAAAAAAAAAAAActctctct

來源

2017-02-21 15:01:42 jas

你是否在標題中加入了標題？ – ceving

注意，不應該考慮以'>'開始的行，標籤，因爲它不是序列字符串的一部分。 –

不，我不是在查看標題中的字符，也不是在計算比例時對它們進行計數。（這就是爲什麼for循環以i = 2開始） – jas

awk '/^>/{b=B;gsub(/[A-]/,"",b); 
      if(length(b) < length(B) * 0.3) print H "\n" B 
      H=$0;B="";next} 

    {B=((B != "") ? B "\n" : "") $0} 

    END{ b=B;gsub(/[A-]/,"",b); 
      if(length(b) < length(B) * 0.3) print H "\n" B 
     }' YourFile

快速QND髒，一個多功能套房有必要更好地打印

來源

2017-02-21 15:05:19 NeronLeVelu

謝謝你的回答。這看起來很花哨，我會試着去看看那裏發生了什麼。 – JFS31

現在我不會使用sed或awk對於長度超過2行的任何內容。

#! /usr/bin/perl 
use strict;        # Force variable declaration. 
use warnings;        # Warn about dangerous language use. 

sub filter         # Declare a sub-routing, a function called `filter`. 
{ 
    my ($header, $body) = @_;    # Give the first two function arguments the names header and body. 
    my $lower = $body =~ tr/a-z//;   # Count the translation of the characters a-z to nothing. 
    print $header, $body, "\n"    # Print header, body and newline, 
    unless $lower/length ($body) > 0.3; # unless lower characters have more than 30%. 
} 

my ($header, $body);      # Declare two variables for header and body. 
while (<>) {        # Loop over all lines from stdin or a file given in the command line. 
    if (/^>/) {        # If the line starts with >, 
    filter ($header, $body)    # call filter with header and body, 
     if defined $header;     # if header is defined, which is not the case at the beginning of the file. 
    ($header, $body) = ($_, '');   # Assign the current line to header and an empty string to body. 
    } else { 
    chomp;         # Remove the newline at the end of the line. 
    $body .= $_;       # Append the line to body. 
    } 
} 
filter ($header, $body);     # Filter the last record.

來源

2017-02-21 16:19:04 ceving

謝謝你的回答，但遺憾的是我從來沒有與Perl合作過。所以我不能真正瞭解你的代碼。如果我想要在awk或sed中使用，我可以理解和修改自己的解決方案，這更符合我的需求。 – JFS31

@ JFS31我添加了一些評論。也許這是學習新東西的好例子。 – ceving

這很好。我會看看，並嘗試從它得到的東西:) – JFS31

刪除含有30％以上小寫字母的行

回答

相關問題