2017-09-18 71 views
1

我的數據是:dplyr/dt總結列是否不爲空/ NA並粘貼?

Name  House Street  Apt City Postal Phone 
DUMA PAUL 2030 GREEN ROAD  DESERT Z0K2K1 999-577-3789 
DUNN S    GREEN ROAD  DESERT Z0K2K1 999-577-3256 
FERGUSON BOB  GREEN ROAD  DESERT Z0K2K1 999-577-3771 
FITSCHEN A 3989 GREEN ROAD  DESERT Z0K2K1 999-577-3557 
BLACK CARY 2079 GREEN ROAD  DESERT Z0K2K1 999-577-3779 
BLACK RUTH 2079 GREEN ROAD  DESERT Z0K2K1 999-577-3779 

我想比較名稱(動態,數據由衆議院排序),如果相等,房子#是平等的,連接具有各自的兩個電話號碼「OR」和刪除行那不是連接起來並串聯了名稱「和」

我使用:

data <- data %>% 
    group_by(House, Street, Apt, City, Postal) %>% 
    summarise(Name = first(paste(Name, collapse = ", AND ")), Phone = 
    paste(unique(Phone), collapse = " OR ")) %>% 
    ungroup() %>% 
    arrange(Street, desc(House)) %>% 
    select(colnames(dataset)) %>% 
    filter(!Phone %in% dnc$`Home Phone`) 

問題:上述dplyr,我串聯如果房子是NA (或空白,我把我的NA留空),Apt是NA(或「」),我不想。因此,使用上面的代碼,我會

Name      House Street Apt City Postal Phone 
    DUNN S, AND FERGUSON BOB  GREEN ROAD  DESERT Z0K2K1 9995773256 
    OR 9995773772 
    DUMAS PAUL    2030 GREEN ROAD DESERT Z0K2K1 
    9995773789 
    BLACK CARY, AND BLACK RUTH 2079 GREEN ROAD DESERT Z0K2K1 
    9995773779 
    FITSCHEN A     3989 GREEN ROAD DESERT Z0K2K1 
    9995773556 

通過以上,請注意鄧恩S,而現在弗格森BOB在一起。我不要那個。

dput(抱歉,如果沒有幫助):

list(structure(list(X__1 = c(NA, NA, NA, NA, NA, NA), Name = c("DUMAS 
    PAUL", 
    "DUNN S", "FERGUSON BOB", "FITSCHEN A", "BLACK CARY", "BLACK RUTH" 
    ), House = c("2030", NA, NA, "3989", "2079", "2079"), Street = c("GREEN 
    ROAD", 
    "GREEN ROAD", "GREEN ROAD", "GREEN ROAD", "GREEN ROAD", "GREEN ROAD" 
    ), Apt = c(NA, NA, NA, NA, NA, NA), City = c("DESERT", "DESERT", 
    "DESERT", "DESERT", "DESERT", "DESERT"), Prov = c("ZK", "ZK", 
    "ZK", "ZK", "ZK", "ZK"), Postal = c("Z0K2K1", "Z0K2K1", "Z0K2K1", 
    "Z0K2K1", "Z0K2K1", "Z0K2K1"), Phone = c("999-577-3789", "999-577-3256", 
    "999-577-3772", "999-577-3556", "999-577-3779", "999-577-3779" 
    ), `Last Appear Date` = c(NA, NA, NA, NA, NA, NA)), .Names = c("X__1", 
    "Name", "House", "Street", "Apt", "City", "Prov", "Postal", "Phone", 
    "Last Appear Date"), class = c("tbl_df", "tbl", "data.frame"), row.names 
    = c(NA, 
    -6L))) 

感謝

回答

2

裏面DT[, {...}, by=],你可以寫幾乎任何東西。在這種情況下,if... else作品:類似可dplyr::do做,大概

library(data.table) 
library(magrittr) 
DT = as.data.table(data) 

DT[, 
    if (!(is.na(House) & is.na(Apt))) 
    .(
     Name = Name %>% paste(collapse = ", AND "), 
     Phone = Phone %>% unique %>% paste(collapse = " OR ") 
    ) 
    else 
    .(Name, Phone) 
, by=.(House, Street, Apt, City, Postal)] 

    House   Street Apt City Postal      Name  Phone 
1: 2030 GREEN \n ROAD NA DESERT Z0K2K1   DUMAS \n PAUL 999-577-3789 
2: NA  GREEN ROAD NA DESERT Z0K2K1      DUNN S 999-577-3256 
3: NA  GREEN ROAD NA DESERT Z0K2K1    FERGUSON BOB 999-577-3772 
4: 3989  GREEN ROAD NA DESERT Z0K2K1     FITSCHEN A 999-577-3556 
5: 2079  GREEN ROAD NA DESERT Z0K2K1 BLACK CARY, AND BLACK RUTH 999-577-3779 

東西。

你不必在這裏使用magrittr;這只是我對paste零件的偏好。您可能還需要在這些管道中添加%>% sort步驟(因此手機和名稱列表始終是遞增的)。

0

我想這個問題沒有「漂亮」的解決方案,這是一個不適合dplyr工作流程的處理。一種解決方法是以某種方式唯一標識具有空數據的房屋。這樣,他們不會被分組在一起。一種方法是在House爲空時輸入「#row_number」。現在他們不會被分組在一起,因爲每一個空行都會有不同的數字。處理完成後,您可以簡單地將#開頭的值替換爲空字符串或NA

data %>% 
    mutate(House = if_else(House == "" | is.na(House), paste0("#", row_number()), House)) %>% 
    # does the processing... %>% 
    mutate(House = if_else(startsWith(House, "#"), "", House))