2016-11-01 69 views
-1

示例數據幀:分開的不同組合到第一和最後使用dplyr,tidyr,和正則表達式

name <- c("Smith John Michael","Smith, John Michael","Smith John, Michael","Smith-John Michael","Smith-John, Michael") 
df <- data.frame(name) 

df 
       name 
1 Smith John Michael 
2 Smith, John Michael 
3 Smith John, Michael 
4 Smith-John Michael 
5 Smith-John, Michael 

我需要實現以下所需的輸出:

    name first.name last.name 
1 Smith John Michael  John  Smith 
2 Smith, John Michael  John  Smith 
3 Smith John, Michael Michael Smith John 
4 Smith-John Michael Michael Smith-John 
5 Smith-John, Michael Michael Smith-John 

的規則如下:如果字符串中有逗號,則以前的任何內容都是姓氏。在逗號後面的第一個單詞是名字。如果字符串中沒有逗號,第一個詞是姓,第二個詞是姓。帶連字符的單詞是一個單詞。我寧願用dplyr和regex來實現這一點,但我會採取任何解決方案。感謝您的幫助

+0

見http://stackoverflow.com/questions/7069076/split-column-at-delimiter-in-data-frame –

回答

1

可以使用分裂之間strsplit切換由","" "基於是否有逗號或不name達到你想要的結果。在這裏,我們定義兩個函數來使演示更清晰。你也可以在函數內嵌入代碼。

get.last.name <- function(name) { 
    lapply(ifelse(grepl(",",name),strsplit(name,","),strsplit(name," ")),`[[`,1) 
} 

strsplit的結果是一個列表。 lapply(...,'[[',1)循環遍歷此列表,並從每個列表元素(這是最後一個名稱)中提取第一個元素。除了我們從由strsplit返回的每個列表元素,它包含第一個名稱提取第二元件

get.first.name <- function(name) { 
    d <- lapply(ifelse(grepl(",",name),strsplit(name,","),strsplit(name," ")),`[[`,2) 
    lapply(strsplit(gsub("^ ","",d), " "),`[[`,1) 
} 

此功能是類似的。然後我們使用gsub刪除任何起始空格,然後我們再次用" "分隔,以便從該strsplit作爲名字返回的每個列表元素中提取第一個元素。

dplyr全部放在一起:

library(dplyr) 
res <- df %>% mutate(first.name=get.first.name(name), 
        last.name=get.last.name(name)) 

結果不出所料:

print(res) 
##     name first.name last.name 
## 1 Smith John Michael  John  Smith 
## 2 Smith, John Michael  John  Smith 
## 3 Smith John, Michael Michael Smith John 
## 4 Smith-John Michael Michael Smith-John 
## 5 Smith-John, Michael Michael Smith-John 

數據:

df <- structure(list(name = c("Smith John Michael", "Smith, John Michael", 
"Smith John, Michael", "Smith-John Michael", "Smith-John, Michael" 
)), .Names = "name", row.names = c(NA, -5L), class = "data.frame") 
##     name 
##1 Smith John Michael 
##2 Smith, John Michael 
##3 Smith John, Michael 
##4 Smith-John Michael 
##5 Smith-John, Michael 
+0

謝謝。那效果很好 – Eric

0

我不知道這是任何總比艾超的回答更好,但我反正也是這樣。我給出了正確的輸出。

df1 <- df %>% 
    filter(grepl(",",name)) %>% 
    separate(name, c("last.name","first.middle.name"), sep = "\\,", remove=F) %>% 
    mutate(first.middle.name = trimws(first.middle.name)) %>% 
    separate(first.middle.name, c("first.name","middle.name"), sep="\\ ",remove=T) %>% 
    select(-middle.name) 

df2 <- df %>% 
    filter(!grepl(",",name)) %>% 
    separate(name, c("last.name","first.name"), sep = "\\ ", remove=F) 

df<-rbind(df1,df2) 
相關問題