2014-09-27 88 views
-5

我有兩個文件,file1包含file2的子字符串。我想匹配file1到file2並輸出匹配左側的部分,而不是匹配本身。我也想知道如何輸出比賽的權利,而不是比賽本身。 這是我的部分數據(這些字符串也可能不匹配,只是示例數據:輸出匹配字符串的左邊或右邊部分

文件1

ACUGUACAGGCCACUGCCUUGC 
CUGCGCAAGCUACUGCCUUGCU 
UGGAAUGUAAAGAAGUAUGUAU 
CGAAUCAUUAUUUGCUGCUCUA 
AUCACAUUGCCAGGGAUUACC 
UUCACAGUGGCUAAGUUCUGC 

文件2

CCAGGCUGAGGUAGUAGUUUGUACAGUUUGAGGGUCUAUGAUACCACCCGGUACAGGAGAUAACUGUACAGGCCACUGCCUUGCCAGG 
CUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCCGCUGUGGAGAUAACUGCGCAAGCUACUGCCUUGCUAG 
GCUUGGGACACAUACUUCUUUAUAUGCCCAUAUGAACCUGCUAAGCUAUGGAAUGUAAAGAAGUAUGUAUUUCAGGC 
CUGUAGCAGCACAUCAUGGUUUACAUACUACAGUCAAGAUGCGAAUCAUUAUUUGCUGCUCUAG 
GGCUGCUUGGGUUCCUGGCAUGCUGAUUUGUGACUUGAGAUUAAAAUCACAUUGCCAGGGAUUACCACGCAACC 

例如:

文件1:

            GCUGUGGAGAUAACUGCGC 

文件2

CUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCCGCUGUGGAGAUAACUGCGCAAGC 

輸出

CUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCC 

回答

1

這裏有幾個方法可以只保留com的文字es如果它存在

a <- "GCUGUGGAGAUAACUGCGC" 
b <- "CUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCCGCUGUGGAGAUAACUGCGCAAGC" 

strsplit(b, a)[[1]][1] 
sub(paste0(a, ".*$"), "", b) 

現在,您只需要將文件讀入R並遍歷每個模式。我不完全相信你在找什麼,但這裏是一個想法

# read data into 2 variables, a and b 
# you could use readLines() to do read from disk 
a <- readLines(textConnection("ACUGUACAGGCCACUGCCUUGC 
CUGCGCAAGCUACUGCCUUGCU 
UGGAAUGUAAAGAAGUAUGUAU 
CGAAUCAUUAUUUGCUGCUCUA 
AUCACAUUGCCAGGGAUUACC 
UUCACAGUGGCUAAGUUCUGC")) 

b <- readLines(textConnection("CCAGGCUGAGGUAGUAGUUUGUACAGUUUGAGGGUCUAUGAUACCACCCGGUACAGGAGAUAACUGUACAGGCCACUGCCUUGCCAGG 
CUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCCGCUGUGGAGAUAACUGCGCAAGCUACUGCCUUGCUAG 
GCUUGGGACACAUACUUCUUUAUAUGCCCAUAUGAACCUGCUAAGCUAUGGAAUGUAAAGAAGUAUGUAUUUCAGGC 
CUGUAGCAGCACAUCAUGGUUUACAUACUACAGUCAAGAUGCGAAUCAUUAUUUGCUGCUCUAG 
GGCUGCUUGGGUUCCUGGCAUGCUGAUUUGUGACUUGAGAUUAAAAUCACAUUGCCAGGGAUUACCACGCAACC")) 

現在,從第一個文件循環每個值

lapply(a, function(x) sapply(strsplit(b, x), "[", 1)) 
+0

@ GracieD:輸出的每個元素都是相同的。嘗試:ll = lapply(a,函數(i)sapply(strsplit(b,a [i]),「[[」,1));對於(我在2:長度(ll))打印(相同(ll [[i]],ll [[i-1]])) – rnso 2014-09-28 02:04:31

+0

@rnso謝謝。更新。 – GracieD 2014-09-28 04:06:04

1

開放的文件句柄到字符串來進行測試:

use strict; 
use warnings; 
use autodie; 

open my $fh1, '<', \ "ACUGUACAGGCCACUGCCUUGC\nCUGCGCAAGCUACUGCCUUGCU\nUGGAAUGUAAAGAAGUAUGUAU\nCGAAUCAUUAUUUGCUGCUCUA\nAUCACAUUGCCAGGGAUUACC\nUUCACAGUGGCUAAGUUCUGC\n"; 
open my $fh2, '<', \ "CCAGGCUGAGGUAGUAGUUUGUACAGUUUGAGGGUCUAUGAUACCACCCGGUACAGGAGAUAACUGUACAGGCCACUGCCUUGCCAGG\nCUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCCGCUGUGGAGAUAACUGCGCAAGCUACUGCCUUGCUAG\nGCUUGGGACACAUACUUCUUUAUAUGCCCAUAUGAACCUGCUAAGCUAUGGAAUGUAAAGAAGUAUGUAUUUCAGGC\nCUGUAGCAGCACAUCAUGGUUUACAUACUACAGUCAAGAUGCGAAUCAUUAUUUGCUGCUCUAG\nGGCUGCUUGGGUUCCUGGCAUGCUGAUUUGUGACUUGAGAUUAAAAUCACAUUGCCAGGGAUUACCACGCAACC\n"; 

while (!eof $fh1 && !eof $fh2) { 
    chomp(my $line1 = <$fh1>); 
    chomp(my $line2 = <$fh2>); 

    print join(' ', split /$line1/, $line2, 2), "\n"; 
} 

輸出:

GUAGUAGUUUGUACAGUUUGAGGGUCUAUGAUACCACCCGGUACAGGAGAUA CAGG 
CUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCCGCUGUGGAGAUAA AG 
GCUUGGGACACAUACUUCUUUAUAUGCCCAUAUGAACCUGCUAAGCUA UUCAGGC 
CUGUAGCAGCACAUCAUGGUUUACAUACUACAGUCAAGAUG G 
GGCUGCUUGGGUUCCUGGCAUGCUGAUUUGUGACUUGAGAUUAAA ACGCAACC 
1

你甚至可以試試這個下面的Perl代碼前,後以及使用$預匹配($`),$ POSTMATCH($')和$ MATCH($ &)的字符串匹配:

InputFiles:

FILE1.TXT:

ACUGUACAGGCCACUGCCUUGC 
CUGCGCAAGCUACUGCCUUGCU 
UGGAAUGUAAAGAAGUAUGUAU 
CGAAUCAUUAUUUGCUGCUCUA 
AUCACAUUGCCAGGGAUUACC 
UUCACAGUGGCUAAGUUCUGC 

FILE2.TXT:

CCAGGCUGAGGUAGUAGUUUGUACAGUUUGAGGGUCUAUGAUACCACCCGGUACAGGAGAUAACUGUACAGGCCACUGCCUUGCCAGG 
CUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCCGCUGUGGAGAUAACUGCGCAAGCUACUGCCUUGCUAG 
GCUUGGGACACAUACUUCUUUAUAUGCCCAUAUGAACCUGCUAAGCUAUGGAAUGUAAAGAAGUAUGUAUUUCAGGC 
CUGUAGCAGCACAUCAUGGUUUACAUACUACAGUCAAGAUGCGAAUCAUUAUUUGCUGCUCUAG 
GGCUGCUUGGGUUCCUGGCAUGCUGAUUUGUGACUUGAGAUUAAAAUCACAUUGCCAGGGAUUACCACGCAACC 

代碼:

use strict; 
use warnings; 

open my $fh1, '<', "file1.txt" or die "Couldnt open the file file1.txt : $!"; 
open my $fh2, '<', "file2.txt" or die "Couldnt open the file file2.txt : $!"; 

while(!eof $fh1 && !eof $fh2) 
{ 
    chomp(my $line1 = <$fh1>); 
    chomp(my $line2 = <$fh2>); 

    if($line2 =~ /$line1/isg) 
    { 
      print "Prematch: $`\n";   
      print "Postmatch: $'\t"; 
      } 
    }  
close($fh1); 
close($fh2); 

輸出:

Prematch: CCAGGCUGAGGUAGUAGUUUGUACAGUUUGAGGGUCUAUGAUACCACCCGGUACAGGAGAUA Postmatch: CAGG 
Prematch: CUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCCGCUGUGGAGAUAA Postmatch: AG 
Prematch: GCUUGGGACACAUACUUCUUUAUAUGCCCAUAUGAACCUGCUAAGCUA Postmatch: UUCAGGC 
Prematch: CUGUAGCAGCACAUCAUGGUUUACAUACUACAGUCAAGAUG Postmatch: G 
Prematch: GGCUGCUUGGGUUCCUGGCAUGCUGAUUUGUGACUUGAGAUUAAA Postmatch: ACGCAACC