2016-07-29 55 views
2

我是R新手,嘗試根據先前行的值刪除行。樣本數據:R根據先前行中的值刪除行

Cust_ID | Date     | Value 
500219 | 2016-04-11 12:00:00 | 0 
500219 | 2016-04-12 16:00:00 | 0 
500219 | 2016-04-14 11:00:00 | 1 
500219 | 2016-04-15 12:00:00 | 1 
500219 | 2016-05-23 09:00:00 | 0 
500219 | 2016-05-02 19:00:00 | 0 
500220 | 2016-04-11 12:00:00 | 0 
500220 | 2016-04-14 11:00:00 | 1 
500220 | 2016-04-15 12:00:00 | 1 
500220 | 2016-05-23 09:00:00 | 0 
500220 | 2016-05-02 19:00:00 | 0 

我想值= 1爲每個CUST_ID給出結果之前僅能維持行:

Cust_ID | Date     | Value 
500219 | 2016-04-11 12:00:00 | 0 
500219 | 2016-04-12 16:00:00 | 0 
500219 | 2016-04-14 11:00:00 | 1 
500219 | 2016-04-15 12:00:00 | 1 
500220 | 2016-04-11 12:00:00 | 0 
500220 | 2016-04-14 11:00:00 | 1 
500220 | 2016-04-15 12:00:00 | 1 

任何幫助,將不勝感激!

回答

2

這是一個分割應用組合方法,可以保留每個客戶的值爲1以及前1個值之前的值。

# split data by customer ID 
myList <- split(df, df$Cust_ID) 
# loop through ID list, drop desired rows, rbind resulting list 
dfNew <- do.call(rbind, lapply(myList, function(i) { 
           drop <- which(i$Value==1) 
           i[c(1:drop[1], drop[-1]),]})) 

返回

dfNew 
     Cust_ID     Date Value 
500219.1 500219 2016-04-11 12:00:00  0 
500219.2 500219 2016-04-12 16:00:00  0 
500219.3 500219 2016-04-14 11:00:00  1 
500219.4 500219 2016-04-15 12:00:00  1 
500220.7 500220 2016-04-11 12:00:00  0 
500220.8 500220 2016-04-14 11:00:00  1 
500220.9 500220 2016-04-15 12:00:00  1 

請注意,如果您想保留意見這一解決方案將無法是否存在永遠不會有等於1


值客戶ID工作從未達到1閾值,則使用

dfNew <- do.call(rbind, lapply(myList, function(i) { 
           drop <- which(i$Value==1) 
           if(length(drop) != 0) i[c(1:drop[1], drop[-1]),] 
           else i})) 
+0

謝謝您的解決方案。不幸的是,我得到以下錯誤: 1:drop [1]中的錯誤:NA/NaN參數 任何幫助將不勝感激! –

+0

我的猜測是,有沒有值等於1的ID。是這樣嗎?如果是這樣,你想怎麼做呢? – lmo

+1

謝謝你的回答。看起來drop在空時拋出一個錯誤。以下作品!如果(長度(下降)!= 0){[((i $ value == 1)){if(length(drop)!= 0) 1:drop [1],drop [-1]),] } })) –

2

我們可以使用data.table。將'data.frame'轉換爲'data.table'(setDT(df1)),按'Cust_ID'分組,我們得到'Value'爲1的索引的max的序列,並獲得行索引(.I)並使用它子集data.table行。

library(data.table) 
setDT(df1)[df1[, if(any(Value == 1)) .I[seq(max(which(Value == 1)))] 
           else .I[1:.N] , by = Cust_ID]$V1] 
#  Cust_ID    Date Value 
#1: 500219 2016-04-11 12:00:00  0 
#2: 500219 2016-04-12 16:00:00  0 
#3: 500219 2016-04-14 11:00:00  1 
#4: 500219 2016-04-15 12:00:00  1 
#5: 500220 2016-04-11 12:00:00  0 
#6: 500220 2016-04-14 11:00:00  1 
#7: 500220 2016-04-15 12:00:00  1 

或使用類似的方法與dplyr

library(dplyr) 
df1 %>% 
    group_by(Cust_ID) %>% 
    slice(if(any(Value==1)) seq(max(which(Value==1))) else row_number()) 
# Cust_ID    Date Value 
#  <int>    <chr> <int> 
#1 500219 2016-04-11 12:00:00  0 
#2 500219 2016-04-12 16:00:00  0 
#3 500219 2016-04-14 11:00:00  1 
#4 500219 2016-04-15 12:00:00  1 
#5 500220 2016-04-11 12:00:00  0 
#6 500220 2016-04-14 11:00:00  1 
#7 500220 2016-04-15 12:00:00  1 
+0

替代方案:'setDT(mydf)[,.SD [seq(max(which(Value == 1)))] by = Cust_ID](更好的可讀性,但對大數據集可能更慢) – Jaap

+0

@弗蘭克這是糾正。感謝那個邊緣案例。 – akrun

+1

@Frank THanks徵求意見。 – akrun

0

循環方式:

cust <- 0 
keep <- FALSE 
keepers <- vector(mode = "logical", length = nrow(df)) 

## walk through the dataframe backwards 
for(rec in nrow(df):1) 
{ 
    ## have we been working with this customer? 
    if(df[rec,]$Cust_ID == cust) 
    { 
    if(df[rec,]$Value == 1 | keep == TRUE) 
    { 
     keepers[rec] = TRUE 
     keep <- TRUE 
    } 
    } 
    else 
    { 
    cust = df[rec,]$Cust_ID 
    if(df[rec,]$Value == 1) 
    { 
     keepers[rec] = TRUE 
     keep <- TRUE 
    } 
    else 
    { 
     keep <- FALSE 
    } 
    } 
} 

df <- df[keepers,] 
df