2015-12-16 40 views
0

我有一點點複雜的問題需要解決。刪除R中大數據集的一組行中的連續行

假設我有這個數據集

Id Name Price sales Profit Month Category Mode Supplier 
1 A  0  0  0  1  X K  John 
1 A  0  0  0  2  X K  John 
1 A  0  0  0  3  X K  John 
1 A  2  5  0  4  X L  Sam 
1 A  2  3  4  5  X L  Sam 
1 A  0  0  0  6  X L  Sam 
2 C  2  4  9  1  X M  John 
2 C  0  0  0  2  X L  John 
2 C  0  0  0  3  X K  John 
2 C  2  8  0  4  Y M  John 
2 C  2  8  10  5  Y K  John 
2 C  0  0  0  6  Y K  John 
3 E  0  0  0  1  Y M  Sam 
3 E  0  0  0  2  Y L  Sam 
3 E  2  5  9  3  Y M  Sam 
3 E  0  0  0  4  Z M  Kyle 
3 E  0  0  0  5  Z L  Kyle 
3 E  0  0  0  6  Z M  Kyle 

現在我想從數據幀刪除這些行,對於那些對Price, sales零值和profit連續三個月產品Id。如何憑身份證

預計產量僅在某些組中刪除在這種情況下,行

Id Name Price sales Profit Month Category Mode Supplier 
1 A  2  5  0  4  X L  Sam 
1 A  2  3  4  5  X L  Sam 
1 A  0  0  0  6  X L  Sam 
2 C  2  4  9  1  X M  John 
2 C  0  0  0  2  X L  John 
2 C  0  0  0  3  X K  John 
2 C  2  8  0  4  Y M  John 
2 C  2  8  10  5  Y K  John 
2 C  0  0  0  6  Y K  John 
3 E  0  0  0  1  Y M  Sam 
3 E  0  0  0  2  Y L  Sam 
3 E  2  5  9  3  Y M  Sam 

這僅僅是一個重複的樣品,我的原始數據已超過80萬行。所以我正在尋找一些在大型數據集上也可以實現的功能。

我已經用我提過的辦法像以前一樣

library(data.table) 
as.data.table(mydf)[, N := .N, by = .(Id, rleid(Price == 0 & sales == 0 & Profit == 0))][ 
    !(Price==0 & sales == 0 & Profit == 0 & N >= 2)] 

這一次,當我試圖接收錯誤'could not find rleid function'和我有data.table包安裝並加載試圖

PS我已經問過這個問題,而其他一些解決方案只適用於小數據,並沒有收到可以解決大數據集上這類問題的答案,這就是爲什麼我再次提出要求。

+0

爲了刪除具有零的產品ID的行連續三個月做行需要有唯一的ID相同或還有其他專欄,如類別,模式或供應商? – Sam

+0

也許你需要更新你安裝的'data.table'包。 –

+0

你有'packageVersion(「data.table」)> =「1.9.6」'?看看[版本歷史](https://github.com/Rdatatable/data.table)。 – lukeA

回答

0

這是相當「自制」,但也許會幫助(我的例子是有點簡單,但思路是一樣的):

library("dplyr") 

# just an example: 

month <- rep(1:7, 3) 
id <- rep(c("A", "C", "E"), each=7) 
price <- c(0,0,0,2,2,0,2,0,0,2,2,0,0,0,2,0,0,0, 1, 1, 1) 
sales <- c(0,0,0,4,3,0,2,0,0,1,3,0,0,0,3,0,0,0, 1, 1, 1) 
supplier <- rep(c("john", "anna", "ben"), 7) 

data.frame(id, price, sales, month, supplier) -> dane 

# lag from a vector shows everything but first element and first element become NA: 

lag1_sales <- lag(dane$sales) 
lag2_sales <- lag(dane$sales, 2) # the same, but without two first elements 

lag1_price <- lag(dane$price) 
lag2_price <- lag(dane$price, 2) 

# I add it to data_frame as columns: 

dane <- cbind(dane, lag1_sales, lag2_sales, lag1_price, lag2_price) 

# mutate creates new column with 1 if sales and price and it's two lags are equal 1, so that I have a marker when was three zeros: 

dane %>% 
    mutate(marker=ifelse(sales==0 & price==0 & 
          lag1_sales==0 & lag2_sales==0 & 
          lag1_price==0 & lag2_price==0, 1, 0)) -> dane 

# marker2 and marker3 are made to marker two rows above this triple markered above: 

marker2 <- c(dane$marker[-1], NA) 
marker3 <- c(dane$marker[-c(1, 2)], NA, NA) 

dane <- cbind(dane, marker2, marker3) 

# I take only rows, which are marked: 

dane %>% 
    filter(!(marker==1 | marker2==1 | marker3==1)) -> new_data 
0

這裏就是我的回答。此代碼刪除行,即使有連續三個月像這樣的例子months: 2,5,6

#Generate data 
month <- rep(1:7, 3) 
id <- rep(c("1", "2", "3"), each=7) 
price <- c(0,0,0,2,2,0,2,0,0,2,2,0,0,0,2,0,0,0, 1, 1, 1) 
sales <- c(0,0,0,4,3,0,2,0,0,1,3,0,0,0,3,0,0,0, 1, 1, 1) 
test <- data.frame(id, price, sales, month) 

#Calculate how many consecutive times a combination of id, 
#price & sales is encountered 
sequence <- rle(paste(test$id,test$price,test$sales,sep="")) 

#calculate the row indexes to keep 
index <- with(sequence, lengths != 3) 
index2 <- unlist(sapply(c(1:length(index)),FUN=function(x){ 
    seq(from=index[x],to=index[x],length.out=sequence$lengths[x]) 
})) 

#store results: 
test2 <- test[index2,]