2016-10-25 93 views
-1

編輯我有一個輸入數據幀是這樣的:的R - GSUB功能

enter image description here

我所要的輸出是這樣的:

enter image description here

請找我的解釋下面。我真的不知道該給一個詳細的解釋超過了這個:(

enter image description here

讓我解釋一下....在輸入數據集,對於具有COL1值「10」行,我想掃描COL2價值觀,以「*」 ......同樣的邏輯也適用於具有重複COL1值的所有COL2值.. 我想使用GSUB功能的..

我更換任何重複的文本模式嘗試gsub連同粘貼幾次,我沒有得到所需的輸出,因爲我不知道如何匹配裏面的所有模式重複。

我已經問過這個問題。但由於我沒有收到答覆,我正在重新發布。

附加以下輸入數據框的dput:

structure(list(COL1 = c(10L, 10L, 10L, 20L, 20L, 30L, 30L, 40L, 
40L, 40L, 50L, 50L, 50L), COL2 = c("mary has life", "Don mary has life", 
"Britto mary has life", "push them fur", "push them ", "yell at this", 
"this is yell at this", "Year", "Doggy", "Horse", "This is great job", 
"great job", "Donkey")), .Names = c("COL1", "COL2"), row.names = c(NA, 
-13L), class = "data.frame") 
+1

10你試過了什麼?你已經得到[一個答案](http://stackoverflow.com/questions/40125508/r-eliminating-duplicate-values)這個問題。那有什麼問題? – Jaap

+0

我嘗試了同樣的答案。我試着按照我的要求修改它。請注意這兩個問題是不同的。任何讀過它的人都會了解其中的差異。我還注意到,我想用這個gsub函數..我從來沒有得到相關的答案。 – Rambo

回答

4

您可以編寫運行gsub一組中的每個項目,並選擇最短的更換功能(從本身不談,當然):

fun <- function(col){ 
    matches <- sapply(col, function(x){gsub(x, '\\*', col)}); 
    diag(matches) <- NA; 
    apply(matches, 1, function(x){x[which.min(nchar(x))]}) 
} 

現在,在你最喜歡的語法實現:

library(dplyr) 

df %>% group_by(COL1) %>% mutate(COL3 = fun(COL2)) 

## Source: local data frame [13 x 3] 
## Groups: COL1 [5] 
## 
##  COL1     COL2   COL3 
## <int>    <chr>   <chr> 
## 1  10  mary has life mary has life 
## 2  10 Don mary has life   Don * 
## 3  10 Britto mary has life  Britto * 
## 4  20  push them fur   *fur 
## 5  20   push them  push them 
## 6  30   yell at this yell at this 
## 7  30 this is yell at this  this is * 
## 8  40     Year   Year 
## 9  40    Doggy   Doggy 
## 10 40    Horse   Horse 
## 11 50 This is great job  This is * 
## 12 50   great job  great job 
## 13 50    Donkey  Donkey 

或全部保留在底座R:

df$COL3 <- ave(df$COL2, df$COL1, FUN = fun) 

df 

## COL1     COL2   COL3 
## 1 10  mary has life mary has life 
## 2 10 Don mary has life   Don * 
## 3 10 Britto mary has life  Britto * 
## 4 20  push them fur   *fur 
## 5 20   push them  push them 
## 6 30   yell at this yell at this 
## 7 30 this is yell at this  this is * 
## 8 40     Year   Year 
## 9 40    Doggy   Doggy 
## 10 40    Horse   Horse 
## 11 50 This is great job  This is * 
## 12 50   great job  great job 
## 13 50    Donkey  Donkey 
+0

您提供的代碼對上述輸入正常工作。但舉例來說,如果我有兩個COL2值作爲「鼠標鼠標」和「鼠標鼠標」,則這兩個值將被替換爲「*」,這是不可取的。只有一個值應該替換爲「*」,另一個值應該保留爲「鼠標鼠標」 – Rambo

+0

@alistaire ...您提供的代碼適用於上述輸入。但舉例來說,如果我有兩個COL2值作爲「鼠標鼠標」和「鼠標鼠標」,則這兩個值將被替換爲「*」,這是不可取的。只有一個值應該替換爲「*」,另一個值應該保持爲「鼠標鼠標」 – Rambo

+0

添加一行以說明重複項,例如, 'fun < - function(col)col [duplicated(col)] < - '*'; 匹配< - sapply(col,function(x){gsub(x,'\\ *',col)}); diag(matches)< - col; apply(matches,1,function(x){x [which.min(nchar(x))]}) }' – alistaire