2013-11-01 126 views
1

我正在使用206行x 196列的矩陣set_onco,我有一個向量,genes_100(它是一個矩陣,但我只帶第一個col),並帶有101個名稱。 這裏是他們如何看待在R中提高循環速度3

> set_onco[1:10,1:10] 
          V2  V3  V4  V5  V6  V7  V8  V9  V10  V11 
GLI1_UP.V1_DN    COPZ1 C10orf46 C20orf118 TMEM181 CCNL2 YIPF1 GTDC1 OPN3 RSAD2 SLC22A1 
GLI1_UP.V1_UP   IGFBP6 HLA-DQB1  CCND2  PTH1R TXNDC12 M6PR PPT2 STAU1  IGJ TMOD3 
E2F1_UP.V1_DN   TGFB1I1 CXCL5 POU5F1 SAMD10 KLF2 STAT6 ENTPD6 VCAN HMGCS1 ANXA8 
E2F1_UP.V1_UP    RRP1B  HES1  ADCY6 CHAF1B VPS37B GRSF1 TLX2 SSX2IP DNA2  CMA1 
EGFR_UP.V1_DN    NPY1R PDZK1  GFRA1  GREB1 MSMB DLC1 MYB SLC6A14 IFI44 IFI44L 
EGFR_UP.V1_UP    FGG  GBP1 TNFRSF11B  FGB GJA1 DUSP6 S100A9  ADM ITGB6 DUSP4 
ERB2_UP.V1_DN    NPY1R PDZK1  ANXA3  GREB1 HSPB8 DLC1 NRIP1 FHL2 EGR3 IFI44 
FAM18B1                          
ERB2_UP.V1_UP   CYP1A1 CEACAM5 FAM129A TNFRSF11B DUSP4 CYP1B1 UPK2 DAB2 CEACAM6 KIAA1199 
GCNP_SHH_UP_EARLY.V1_DN SRRM2 KIAA1217  DEFA1  DLK1 PITX2 CCL2 UPK3B SEZ6 TAF15  EMP1 

genes_100[1:10,1] 
[1] AL591845.1 B3GALT6  RAP1GAP  HSPG2  BX293535.1 RP1-159A19.1 IFI6   FAM76A  FAM176B  CSF3R  
101 Levels: 5_8S_rRNA AC018470.1 AC091179.2 AC103702.3 AC138972.1 ACVR1B AL049829.5 AL137797.2 AL139260.2 AL450326.2 AL591845.1 AL607122.2 B3GALT6 BX293535.1 ... ZNF678 

我想要做的是通過矩陣來分析和計算,在每一行中包含genes_100

的名字做,我爲創建3頻片段循環:第一個向下移動一行,第二個移動到行,第三個循環遍歷列表genes_100檢查匹配。 最後我保存在一個矩陣多少次genes_100每一行中的條款相匹配,從矩陣還節省了該行的名稱(這樣我才知道哪個是哪個)

代碼工作,並給了我正確的輸出......但它真的很慢!

輸出的一個片段是:

head(result_matrix_100)

    freq_100 
[1,] "GLI1_UP.V1_DN" "0"  
[2,] "GLI1_UP.V1_UP" "0"  
[3,] "E2F1_UP.V1_DN" "0"  
[4,] "E2F1_UP.V1_UP" "0"  
[5,] "EGFR_UP.V1_DN" "0"  
[6,] "EGFR_UP.V1_UP" "0" 

我用system.time(),我得到:

user system elapsed 
525.38 0.06 530.34 

這是太慢,因爲我有更大的矩陣解析,在某些情況下,我必須重複這10k次!

代碼:

result_matrix_100 <- matrix(nrow=0, ncol=2)

for (q in seq(1,nrow(set_onco),1)) { 
    for (j in seq(1, length(set_onco[q,]),1)) { 
    for (x in seq(1,101,1)) { 
     if (as.character(genes_100[x,1]) == as.character(set_onco[q,j])) { 
     freq_100 <- freq_100+1 
     } 
    } 
    } 
    result_matrix_100 <- rbind(result_matrix_100, cbind(row.names(set_onco)[q], freq_100)) 
} 

什麼你有什麼建議?

在此先感謝:)

+1

您可以發佈示例數據和結果嗎? – David

+0

聽起來像一個應用程序plyr可能會有用嗎? – colcarroll

+0

當然,簡單的方法的第一步就是簡單地在你的矩陣上調用'table' ...?而'++ freq_100'完全不像R碼。 – joran

回答

1

@ joran的將可能會更快,雖然它可能不是「因素 - 安全」。你set_onco值可能編碼爲因子變量(因爲你genes_100對象顯然是)。這將是更安全:

set_onco[] <- lapply(set_onco, as.character) 
# that converts a data.frame with factor columns to character valued 
# at that point @joran's solution could be used safely 
freq100 <- apply(set_onco, 1, function(x) sum(x %in% genes_100)) 
# that does a row-by-row count of the number of matches to genes_100 
freq100 
      GLI1_UP.V1_DN   GLI1_UP.V1_UP   E2F1_UP.V1_DN 
         0      0      0 
      E2F1_UP.V1_UP   EGFR_UP.V1_DN   EGFR_UP.V1_UP 
         0      0      0 
      ERB2_UP.V1_DN   ERB2_UP.V1_UP GCNP_SHH_UP_EARLY.V1_DN 
         0      0      0 

您的數據集的大小(206行×196米的cols)是相當小,因此這將是幾乎即時。這些dput語句和輸出可用於構建我認爲您的對象在內部看起來像:

dput(set_onco) 
structure(list(V2 = structure(c(1L, 4L, 8L, 6L, 5L, 3L, 5L, 2L, 
7L), .Label = c("COPZ1", "CYP1A1", "FGG", "IGFBP6", "NPY1R", 
"RRP1B", "SRRM2", "TGFB1I1"), class = "factor"), V3 = structure(c(1L, 
6L, 3L, 5L, 8L, 4L, 8L, 2L, 7L), .Label = c("C10orf46", "CEACAM5", 
"CXCL5", "GBP1", "HES1", "HLA-DQB1", "KIAA1217", "PDZK1"), class = "factor"), 
    V4 = structure(c(3L, 4L, 8L, 1L, 7L, 9L, 2L, 6L, 5L), .Label = c("ADCY6", 
    "ANXA3", "C20orf118", "CCND2", "DEFA1", "FAM129A", "GFRA1", 
    "POU5F1", "TNFRSF11B"), class = "factor"), V5 = structure(c(7L, 
    5L, 6L, 1L, 4L, 3L, 4L, 8L, 2L), .Label = c("CHAF1B", "DLK1", 
    "FGB", "GREB1", "PTH1R", "SAMD10", "TMEM181", "TNFRSF11B" 
    ), class = "factor"), V6 = structure(c(1L, 8L, 5L, 9L, 6L, 
    3L, 4L, 2L, 7L), .Label = c("CCNL2", "DUSP4", "GJA1", "HSPB8", 
    "KLF2", "MSMB", "PITX2", "TXNDC12", "VPS37B"), class = "factor"), 
    V7 = structure(c(8L, 6L, 7L, 5L, 3L, 4L, 3L, 2L, 1L), .Label = c("CCL2", 
    "CYP1B1", "DLC1", "DUSP6", "GRSF1", "M6PR", "STAT6", "YIPF1" 
    ), class = "factor"), V8 = structure(c(2L, 5L, 1L, 7L, 3L, 
    6L, 4L, 8L, 9L), .Label = c("ENTPD6", "GTDC1", "MYB", "NRIP1", 
    "PPT2", "S100A9", "TLX2", "UPK2", "UPK3B"), class = "factor"), 
    V9 = structure(c(4L, 8L, 9L, 7L, 6L, 1L, 3L, 2L, 5L), .Label = c("ADM", 
    "DAB2", "FHL2", "OPN3", "SEZ6", "SLC6A14", "SSX2IP", "STAU1", 
    "VCAN"), class = "factor"), V10 = structure(c(8L, 6L, 4L, 
    2L, 5L, 7L, 3L, 1L, 9L), .Label = c("CEACAM6", "DNA2", "EGR3", 
    "HMGCS1", "IFI44", "IGJ", "ITGB6", "RSAD2", "TAF15"), class = "factor"), 
    V11 = structure(c(8L, 9L, 1L, 2L, 6L, 3L, 5L, 7L, 4L), .Label = c("ANXA8", 
    "CMA1", "DUSP4", "EMP1", "IFI44", "IFI44L", "KIAA1199", "SLC22A1", 
    "TMOD3"), class = "factor")), .Names = c("V2", "V3", "V4", 
"V5", "V6", "V7", "V8", "V9", "V10", "V11"), class = "data.frame", row.names = c("GLI1_UP.V1_DN", 
"GLI1_UP.V1_UP", "E2F1_UP.V1_DN", "E2F1_UP.V1_UP", "EGFR_UP.V1_DN", 
"EGFR_UP.V1_UP", "ERB2_UP.V1_DN", "ERB2_UP.V1_UP", "GCNP_SHH_UP_EARLY.V1_DN" 
)) 

dput(factor(genes_100)) 
structure(c(1L, 2L, 9L, 7L, 3L, 10L, 8L, 6L, 5L, 4L), .Label = c("AL591845.1", 
"B3GALT6", "BX293535.1", "CSF3R", "FAM176B", "FAM76A", "HSPG2", 
"IFI6", "RAP1GAP", "RP1-159A19.1"), class = "factor") 
1

像這樣的東西可能會相當快:

#Sample data 
m <- matrix(sample(letters,206*196,replace = TRUE),206,196) 
genes_100 <- letters[1:5] 

m1 <- matrix(m %in% genes_100,206,196) 
rowSums(m1)