具有相同子

我通過含有在第1列和第2列的數字值（得分）的變量名2對象具有n刪除後續元素：具有相同子

data  <- data.frame(matrix(nrow = 20, ncol = 2)) 
data[, 2] <- 1:20 
data[, 1] <- c("example_a_1", "example_a_2", "example_a_3", 
       "example_b_1", "example_c_1", "example_d_1", 
       "example_d_2", "example_d_3", "example_f_1", 
       "example_g_1", "example_g_2", "example_h_1", 
       "example_i_1", "example_l_1", "example_o_1", 
       "example_j_1", "example_m_1", "example_p_1", 
       "example_k_1", "example_n_1") 
data 
      X1 X2 
1 example_a_1 1 
2 example_a_2 2 
3 example_a_3 3 
4 example_b_1 4 
5 example_c_1 5 
6 example_d_1 6 
7 example_d_2 7 
8 example_d_3 8 
9 example_f_1 9 
10 example_g_1 10 
11 example_g_2 11 
12 example_h_1 12 
13 example_i_1 13 
14 example_l_1 14 
15 example_o_1 15 
16 example_j_1 16 
17 example_m_1 17 
18 example_p_1 18 
19 example_k_1 19 
20 example_n_1 20

我不想該對象包含類似的變量 - 如果一個變量名的前9個字符（在這個例子中）與另一個變量名相同，那麼它是重複的。在這些情況下，我只想保留第一個類似命名的變量。

我可以得到一個列表，它的變量名是重複這樣的：

rep <- as.data.frame(table(substr(data[,1], 1, 9))) 
rep <- rep[rep[, 2] > 1, ] 
rep 
     Var1 Freq 
1 example_a 3 
4 example_d 3 
6 example_g 2

，從而確定它們在for外環或其他條件：

for(i in 1:nrow(data)){ 
    if(substr(data[i, 1], 1, 9) %in% rep[,1])){ 
    # What goes here? 
    # or what's another approach? 
    } 
}

不過，我不確定我可以用什麼邏輯刪除重複名稱的行？

的最終目標應該是這樣的：

data 
      X1 X2 
1 example_a_1 1 
2 example_b_1 4 
3 example_c_1 5 
4 example_d_1 6 
5 example_f_1 9 
6 example_g_1 10 
7 example_h_1 12 
8 example_i_1 13 
9 example_l_1 14 
10 example_o_1 15 
11 example_j_1 16 
12 example_m_1 17 
13 example_p_1 18 
14 example_k_1 19 
15 example_n_1 20

來源

2015-06-22 Hack-R

'數據[！複製（SUBSTR（數據$ X1，1,9）），]'？ –

@Frank完成，謝謝 –

我更喜歡@RobertH的解決方案，使用重複的，隨時更改您接受的答案。 – zx8754

您可以使用duplicated

short <- substr(data[,1], 1, 9) 
i <- duplicated(short) 
data <- data[!i , ]

來源

2015-06-22 18:05:01 RobertH

使用dplyr：

library(dplyr) 
data <- data %>% 
      group_by(my9=substr(X1,1,9)) %>% 
      filter(row_number(my9)==1) %>% 
      select(-my9)

來源

2015-06-22 18:01:38 zx8754

太棒了！這對我行得通。 –

我將創建一個列有該列的縮略名稱和彙總：

data$short <- substr(data[,1], 1, 9) 
agg <- aggregate(data$X2~data$short, FUN=min)

我使用了min，因爲您似乎對每個重複名稱的最小分數感興趣

來源

2015-06-22 18:02:02 Michal

這也是一個很好的方法。 +1雖然在這個例子中分數完全是任意的，但在我的實際使用情況下，我可以做到這一點，並用'max'代替'min'。 –

回答

相關問題