如何查找符合條件並返回R中匹配行的所有行？

-1

我有兩個數據框，一個用於SNP，一個用於基因。如何查找符合條件並返回R中匹配行的所有行？

對於每個基因，如果SNP位置在窗口大小內，我想將該行返回到數據框。然後，我想找出該特定SNP與該基因在該窗口中的相關性。我目前使用R.

基因數據幀：

Chr Start End sample1 sample2 sample3 
10 100015109 100015443 2 1 1 
10 100365832 100368960 1 0 2 
10 100486970 100487277 2 1 0

個SNP數據幀：

SNP CHROM POSITION sample1 sample2 sample3 
rs3766180 1 1478153 1 1 2 
rs7540231 1 1506035 2 2 0 
rs2272908 1 1721479 1 1 2 
rs10907187 1 1759054 0 1 2

我有這樣的代碼，到目前爲止，但我不知道如果我做出正確的迭代。我想遍歷這些基因，並檢查哪些snps位於窗口大小內，並找到該snp和該基因之間的r平方。例如，如果snp1的位置位於基因的起始和結束範圍內，則選擇該行，然後在這兩行之間找到r平方。我認爲我的循環是錯誤的，可能有一個更簡單的方法。請幫忙。

snps <- as.matrix(read.table("snps.txt", header=T, sep="\t")) 
genes <- as.matrix(read.table("genes.txt", header=T, sep="\t")) 

#Set upper and lower bounds 
size = 1000000 
window_left = genes$cnvStart - size 
window_right = genes$cnvEnd + size 
snp_pos <- snps$POS 
snp <- snps$ID 


for (s in 1:nrow(snps)){ 
    for(g in 1:nrow(genes)){ 
    if (snp_pos >window_left & snp_pos < window_right){ 
     corr.matrix2 <- (cor(t(s),t(g),use="pairwise.complete.obs", method="pearson")) 
     new_snps <- cbind(snp, snps[,-3]) 
    } 
    } 
}

我希望的輸出是每個選定的snp基因比較的r平方值表。任何想法將不勝感激。

感謝，內華達州

來源

2015-08-20 Nev

一個偏估計，以幫助，與所希望的輸出的最小可重放例子將有助於我理解問題。 – user2673238

好的，謝謝。我編輯了這篇文章。我希望澄清我遇到的麻煩。請讓我知道如果它仍然不清楚。 – Nev

你對「r-squared」有什麼期望？你認爲這意味着什麼？我建議你看[wikipedia]（https://en.wikipedia.org/wiki/Coefficient_of_determination）。你的數據有多大？ – Llopis

好吧還有一些我不清楚。

第一：在沒有數據框SNP的位置是在Genes數據幀的Start和End內 - 我做了，他們就是一個例子。

第二：你想使用該行關聯sample1,2和3下的另一行？

e.i如果你想這些行。

Chr Start End sample1 sample2 sample3 
10 100015109 100015443 2 1 1 <---- THIS ROW? 

SNP CHROM POSITION sample1 sample2 sample3 
rs3766180 1 1478153 1 1 2 <---- AND THIS ROW? 

My understanding is that you want to correlate 2 1 1 with 1 1 2

我有一個工作示例現在：

Genes<-data.frame(Chr=c(10,10,10),Start=c(100015109,100365832,100486970),End=c(100015443,100368960,100487277),sample1=c(2,1,2),sample2=c(1,0,1),sample3=c(1,2,0)) 
SNP <- data.frame(SNP= c("rs3766180","rs7540231","rs2272908"),CHROM=c(1,1,1),POSITION=c(100015200,100365831,100486971),sample1=c(1,2,1),sample2=c(1,2,1),sample3=c(2,0,2)) 

> Genes 
    Chr  Start  End sample1 sample2 sample3 
1 10 100015109 100015443  2  1  1 
2 10 100365832 100368960  1  0  2 
3 10 100486970 100487277  2  1  0 
> SNP 
     SNP CHROM POSITION sample1 sample2 sample3 
1 rs3766180  1 100015200  1  1  2 
2 rs7540231  1 100365831  2  2  0 
3 rs2272908  1 100486971  1  1  2 

CorTestMatrix <- data.frame() 

for (igene in 1:nrow(Genes)) { # for every gene 
     curGeneRow <- Genes[igene ,] # take that row 
     for (isnp in 1:nrow(SNP)) { # for every SNP 
       cursnp <- SNP[isnp ,] # take that row of SNP 
       if (cursnp$POSITION > curGeneRow$Start & curGeneRow$End > cursnp$POSITION) { # is the SNP in the Gene Window= 
         x<-as.numeric(as.vector(cursnp[,4:ncol(cursnp)])) # if you want the row from Position, 
         y<-as.numeric(as.vector(curGeneRow[,4:ncol(curGeneRow)])) # and want the row from End 
         corTest <- cor.test(x,y) # correlate those two 
         CurTestMatrix <- data.frame(GeneChr=curGeneRow$Chr,SNP=levels(droplevels(cursnp$SNP)),test=as.numeric(corTest[3])) 
         # saves some info from both Dataframes and the p Value from the cortest. 
         # you need edit this to get additional data. 
         CorTestMatrix <- smartbind(CorTestMatrix,CurTestMatrix) 

       } 

     } 
} 

> CorTestMatrix 
    GeneChr  SNP  test 
1:1  10 rs3766180 0.6666667 
1:2  10 rs3766180 0.6666667 
2  10 rs2272908 0.3333333

有可能是做這一個更快的方法，而是一個for循環是簡單的編輯和玩。我已經做到這樣，SNP的第一和第三行應該分別在GeneRow 1和3的開始和結束之內，分別等於2次相關性測試。

我會建議，如果需要校正非正常分佈的樣本：

sp_Cov1 <- shapiro.test(x);sp_Cov2 <- shapiro.test(y) # correction for non-normallity 
if(sp_Cov1[2] < 0.05 | sp_Cov2[2] < 0.05) {correlationToUse = 'kendall' 
} else {correlationToUse = 'pearson'} 

corTest <- cor.test(x,y,method=correlationToUse)

爲了避免的p$value

來源

2015-08-20 18:57:21 user2673238

是的，你的理解是正確的。我完全忘了抓住SNP落入這個範圍的例子。對於那個很抱歉。但是，是的，我想將snp df中的那一行與基因df中的那一行相關聯。我明白你的估計偏差是什麼意思。我會給這個嘗試和驗證！謝謝。 – Nev

酷，別的，然後表明答案是正確的：D – user2673238

當然會。剛剛完成了運行。我得到這個錯誤：UseMethod（「droplevels」）中的錯誤：沒有應用於類「NULL」對象的'droplevels'的適用方法我認爲drop級別是基本包的一部分。我不知道爲什麼會出現這個錯誤。 – Nev

我複製你的代碼和評論這

snps <- as.matrix(read.table("snps.txt", header=T, sep="\t")) 
genes <- as.matrix(read.table("genes.txt", header=T, sep="\t"))

這是沒有錯的，但最好是使該類型的文件，如果他們是分開的名稱清晰通過標籤它們是tsv文件（t ab s eDrated f iles）。這樣，您就可以與其他軟件（Microsoft Excel或類似）

#Set upper and lower bounds 
size = 1000000 
window_left = genes$cnvStart - size 
window_right = genes$cnvEnd + size 
snp_pos <- snps$POS 
snp <- snps$ID

更容易打開他們在這裏設置變量，但你得到的載體，所以SNP或snp_pos是矢量。如果您想稍後使用它，您必須知道您需要哪種數據。

for (s in 1:nrow(snps)){ 
    for(g in 1:nrow(genes)){

獲得所需的數據幀後，您可以通過snps行數和基因行數迭代。爲什麼你不使用snp_pos和snp變量？

if (snp_pos >window_left & snp_pos < window_right){

在這裏，您正在比較所有你想要的，你不需要兩個先前的循環。

  corr.matrix2 <- (cor(t(s),t(g),use="pairwise.complete.obs", method="pearson"))

您不使用選定的變量來創建成對關聯。你應該使用你的變量。我還建議繪製視覺比較的相關性。（您可能需要它們太）

 new_snps <- cbind(snp, snps[,-3]) 
    } 
    } 
}

這不會創造它加入一個數據幀這不是一個表矢量表。

我還沒有測試，但我會做這樣的事情：

snps <- as.matrix(read.table("snps.txt", header=T, sep="\t")) 
genes <- as.matrix(read.table("genes.txt", header=T, sep="\t")) 

#Set upper and lower bounds 
size = 1000000 
window_left = genes$cnvStart - size 
window_right = genes$cnvEnd + size 

in_window = snps[snps$POS >window_left & snps$POS < window_right] 
corr.matrix2 <- (cor(in_window$, in_window$ ,use="pairwise.complete.obs", method="pearson"))

我真的不知道你想做哪個相關，所以你應該改變COR的前兩個參數函數（不完整的in_window $）。我想你想比較哪些樣本有哪些SNP。但這是另一個問題;）

來源

2015-08-20 18:53:48 Llopis

好吧，我正在創建兩個循環，然後爲我的情況做同樣的事情。難怪我沒有得到任何東西。我會嘗試這兩種方法，然後驗證。謝謝。 – Nev

喲沒有得到任何東西，都沒有錯誤？ – Llopis

我沒有在我的版本中得到任何東西。甚至沒有錯誤。我現在仍在運行這兩種方法。完成後會更新。謝謝！！！！ – Nev

如何查找符合條件並返回R中匹配行的所有行？

回答

相關問題