2016-12-10 48 views
1

我有以下數據:合併數據

Data <- data.frame(Project=c(123,123,123,123,123,123,124,124,124,124,124,125,125,125,126,126), 
        Value=c(1,4,7,3,8,9,8,3,2,5,6,2,2,1,8,3), 
        OldValue=c("","Open","In Progress","Complete","Open","In Progress","Complete","Open","In Progress","System Declined","In Progress","","Open","In Progress","In Progress",""), 
        NewValue=c("Open","In Progress","Complete","Open","In Progress","Complete","Open","In Progress","System Declined","In Progress","Complete","Open","In Progress","Complete","","In Progress")) 

Data$First <- ifelse(((Data$OldValue==""|Data$OldValue=="Complete"|Data$OldValue=="System Declined")&Data$NewValue=="Open"),Data$Value,NA) 
Data$Second <- ifelse(((Data$OldValue=="Open"|Data$OldValue=="Complete"|Data$OldValue=="System Declined")&Data$NewValue=="In Progress"),Data$Value,NA) 
Data$Third <- ifelse(((Data$NewValue=="Complete"|Data$NewValue=="System Declined")&Data$OldValue=="In Progress"),Data$Value,NA) 

enter image description here

對於每個唯一的項目ID,我想將第一,第二&第三值組合成一排。我只是想這樣做,如果在的NewValue列中的值遵循兩種下面的序列中的:

公開賽正在進行,完成 或 打開,在進步,系統拒絕

所以項目123將有兩個數據行以及項目125將會有一個。第10行和第11行將被排除,因爲它不符合上述順序

對此進行編碼的最簡單方法是什麼?

回答

0

使用dplyr一個解決方案:

library(dplyr) 
Data %>% group_by(Project) %>% 
    mutate(
     fl = as.numeric(NewValue), 
     flag = paste(lag(fl, 2, default = 0), 
        lag(fl, 1, default = 0), 
        fl, sep = ''), 
     merge = paste(lag(Value, 2, default = 0), 
         lag(Value, 1, default = 0), 
         Value, sep = ',') 
    ) %>% 
    filter(flag == '321' | flag == '324') %>% 
    select(Project, merge) 

# Project merge 
#  <dbl> <chr> 
# 1  123 1,4,7 
# 2  123 3,8,9 
# 3  124 8,3,2 
# 4  125 2,2,1 
0

這將是實現目標的一種方式。我想創建一個數字序列,它可以表示您指定的兩種模式(即,開放式進度完成和開放式進度系統已拒絕)。出於這個原因,我使用fct_collapse()將因子水平摺疊爲三。然後,我將新的因子水平轉換爲數字。然後,我想在每個Project內創建一個子組,我在第二個mutate()中完成了子組。接下來的任務是更改First,SecondThird中元素的順序。你想在一行中有數字。所以我用sort()。有一個條件適用於此操作,即identical(check[1:3], as.numeric(1:3))。如果您有兩種模式中的任何一種,則應該預期在check中會有一個1,2,3的序列。您對每個組使用此邏輯檢查。只要符合該邏輯條件,sort()就應用於由Projectgroup定義的每個組中的三列。最後,我刪除了checkgroup,我用這個整個操作。

library(dplyr) 
library(forcats) 

Data %>% 
mutate(check = as.numeric(
        as.character(fct_collapse(NewValue, 
              `1` = "Open", 
              `2` = "In Progress", 
              `3` = c("Complete", "System Declined"))))) %>% 
group_by(Project) %>% 
mutate(group = cumsum(c(TRUE, diff(check) != 1))) %>% 
group_by(Project, group) %>% 
mutate_at(vars(First:Third), 
      funs(if(identical(check[1:3], as.numeric(1:3))){ 
       sort(., na.last = TRUE)} else{.} 
     )) %>% 
select(-check, -group) 

# group Project Value  OldValue  NewValue First Second Third 
# <int> <dbl> <dbl>   <fctr>   <fctr> <dbl> <dbl> <dbl> 
#1  1  123  1       Open  1  4  7 
#2  1  123  4   Open  In Progress NA  NA NA 
#3  1  123  7  In Progress  Complete NA  NA NA 
#4  2  123  3  Complete   Open  3  8  9 
#5  2  123  8   Open  In Progress NA  NA NA 
#6  2  123  9  In Progress  Complete NA  NA NA 
#7  1  124  8  Complete   Open  8  3  2 
#8  1  124  3   Open  In Progress NA  NA NA 
#9  1  124  2  In Progress System Declined NA  NA NA 
#10  2  124  5 System Declined  In Progress NA  5 NA 
#11  2  124  6  In Progress  Complete NA  NA  6 
#12  1  125  2       Open  2  2  1 
#13  1  125  2   Open  In Progress NA  NA NA 
#14  1  125  1  In Progress  Complete NA  NA NA 
+0

感謝您的幫助,但我遇到了一個問題,當它應用到mt更大的數據集。我意識到在我的原始邏輯(Data $ First,Data $ Second,Data $ Third)中沒有包含OldValue,NewValue關係的幾個實例。我更新了問題中的原始代碼,以反映我正在談論的確切實例。當我用新數據重新運行代碼時,出現錯誤「R Session Aborted。R遇到致命錯誤,會話終止」。 – Dfeld

+0

@Dfeld你的數據大小是多少?我現在要爲我的工作做好準備,以後再看看你的問題。希望你不要介意。 – jazzurro

+0

從csv文件導入大約10,000行。我已經成功測試了幾千行代碼,但是像上面解釋過的並添加到代碼中的實例會導致R崩潰。再次感謝爵士樂。不要急 – Dfeld