循環/如果其他在R數據幀

-5

我真的堅持做一個循環在R我也嘗試使用ifelse了，但似乎無法得到一個結果。循環/如果其他在R數據幀

我有一個數據幀，如下所示，其示出了客戶ID，它們的行進，模式和旅程開始時間日期：

ID  | Date  | Mode | Time 
------ | --------- | ------- | ----- 
1234 | 12/10/16 | Bus  | 120 
1234 | 12/10/16 | Bus  | 130 
1234 | 12/10/16 | Bus  | 290 
1234 | 12/10/16 | Train | 310 
1234 | 12/10/16 | Bus  | 330 
4567 | 12/10/16 | Bus  | 220 
4567 | 12/10/16 | Bus  | 230 
4567 | 13/10/16 | Bus  | 290 
4567 | 13/10/16 | Bus  | 450 
4567 | 14/10/16 | Train | 1000

所以12/10上，客戶1234由4個總線jnys和1次列車JNY。

我想要創建一個第五列，用於標識旅程階段是否已鏈接，即第二旅程鏈接到第一旅程，是第二旅程鏈接到第二旅程（其中1 =已鏈接，0 =未鏈接）。

以下條件必須申請：

的jnys是同一人，發生在同一天
2的巴士旅程都在彼此的60分鐘（這樣一在彼此的60分鐘的公共汽車和火車旅程將不被連接）
如果第i + 1和第i個旅程是鏈接的，則第i + 1旅程不能鏈接到第i + 2旅程

我想輸出如下：

ID  | Date  | Mode | Time | Linked 
------ | --------- | ------- | ----- | ----- 
1234 | 12/10/16 | Bus  | 120 | 0 
1234 | 12/10/16 | Bus  | 130 | 1 
1234 | 12/10/16 | Bus  | 290 | 0 
1234 | 12/10/16 | Train | 310 | 0 
1234 | 12/10/16 | Bus  | 330 | 0 
4567 | 12/10/16 | Bus  | 220 | 0 
4567 | 12/10/16 | Bus  | 230 | 1 
4567 | 13/10/16 | Bus  | 290 | 0 
4567 | 13/10/16 | Bus  | 450 | 0 
4567 | 14/10/16 | Train | 1000 | 0

任何幫助，將不勝感激！

來源

2016-10-18 JassiL

顯示你自己的努力將不勝感激！ – 989

我從字面上沒有任何地方 – JassiL

我喜歡格洛騰迪克的答案，但是對於新來R的人來說可能並不那麼容易。因此，讓我們以一種不太有效的方式做到這一點，向您展示要採取的步驟。我將使用與Grothendieck相同的數據框命名約定。

確定行程之間的時間是否在60分鐘之內。讓循環遍歷數據框中的所有行，並且如果它們是相同的賬戶，並且它們是相同類型的模式，則檢查它們是否小於60分鐘，並且如果所有三個條件都檢出，則設置鏈接到1.否則，我們將設置鏈接到0

for (i in 2:dim(df)[1]){ 
    if (df$ID[i]==df$ID[i-1]){ 
    if (df$Mode[i]==df$Mode[i-1]){ 
     if ((df$Time[i]-df$Time[i-1]) < 60){ 
     df$linked[i] <- 1 
     } 
     else { 
     df$linked[i] <- 0 
     } 
    } 
    else { 
     df$linked[i] <- 0 
    } 
    } 
    else { 
    df$linked[i] <- 0 
    } 
}

來源

2016-10-18 12:57:17 JJFord3

你不想要'df $ linked [i]'而不是'df $ linked'？另外：+1表示邏輯，但請注意這可能比G.Grothendieck的答案慢許多倍** ... –

當然'G. Grothendieck'可以減少嘲諷，但R實際上鼓勵你使用矢量化操作，比如'diff（）'和'cumsum（）'。矢量化的操作也可以很容易地解釋爲新的R的人。 –

這是我正在嘗試，但是，我不能得到它的工作，因爲它的錯誤說： – JassiL

1）AVE嘗試這種情況：

transform(DF, linked = ave(Time, ID, Date, cumsum(c(FALSE, Mode[-1] != Mode[-nrow(DF)])), 
     FUN = function(x) c(0, diff(x) < 60)))

，並提供：

 ID  Date Mode Time linked 
1 1234 12/10/16 Bus 120  0 
2 1234 12/10/16 Bus 130  1 
3 1234 12/10/16 Bus 290  0 
4 1234 12/10/16 Train 310  0 
5 1234 12/10/16 Bus 330  0 
6 4567 12/10/16 Bus 220  0 
7 4567 12/10/16 Bus 230  1 
8 4567 13/10/16 Bus 290  0 
9 4567 13/10/16 Bus 450  0 
10 4567 14/10/16 Train 1000  0

2）sqldf下面是使用sqldf溶液。

library(sqldf) 
sqldf("select a.*, coalesce(a.ID = b.ID and 
          a.Date = b.Date and 
          a.Mode = b.Mode and 
          a.Time < b.Time + 60, 0) linked 
     from DF a left join DF b on a.rowid = b.rowid + 1")

3）data.table注意data.table趨向於快速和高效的存儲器，並且可以是能夠在存儲器中的其他方法不能處理數據的大小。

library(data.table) 

dt <- as.data.table(DF) 
dt[, linked := (Time < shift(Time, fill = -60) + 60) * 
       (Mode == shift(Mode, fill = Mode[1])), by = "ID,Date"]

4）dplyr

library(dplyr) 
DF %>% 
    group_by(ID, Date) %>% 
    mutate(linked = (Time < lag(Time, default = -Inf) + 60) * 
        (Mode == lag(Mode, default = Mode[1]))) %>% 
    ungroup()

給出了類似的回答。

注：輸入DF在重現的形式是：

Lines <- 
"ID  | Date  | Mode | Time 
------ | --------- | ------- | ----- 
1234 | 12/10/16 | Bus  | 120 
1234 | 12/10/16 | Bus  | 130 
1234 | 12/10/16 | Bus  | 290 
1234 | 12/10/16 | Train | 310 
1234 | 12/10/16 | Bus  | 330 
4567 | 12/10/16 | Bus  | 220 
4567 | 12/10/16 | Bus  | 230 
4567 | 13/10/16 | Bus  | 290 
4567 | 13/10/16 | Bus  | 450 
4567 | 14/10/16 | Train | 1000" 
DF <- read.table(text = Lines, header = TRUE, sep = "|", strip.white = TRUE, 
comment = "-", as.is = TRUE)

更新：固定。

來源

2016-10-18 12:37:49

欣賞迴應 - 有沒有一種方法可以避免閱讀表格？該表格非常大，所以這並不覺得正確的做法 – JassiL

R僅適用於內存中的數據幀，因此您必須加載數據。如果您的數據在數據庫中，則可以嘗試使用例如'dplyr'軟件包。它將R語句轉換爲在數據庫連接中運行的SQL，然後僅返回您感興趣的子集。 –

已添加sqldf解決方案。就其本身而言，它假定DF在內存中，但無論它是什麼，都可以將SQL適應於數據庫設置。 –

使用dplyr包：

library(dplyr) 
DF %>% 
    # The journeys are for the same person, take place on the same day 
    # and on the same mode of transport 
    group_by(ID, Date, Mode) %>% 
    # 2 bus journeys are within 60 mins of one another 
    mutate(linked0 = c(Inf, diff(Time))<60, 
      # if the i+1th and the ith journey are linked, 
      # then the i+1th journey cannot be linked to the i+2th journey 
      linkedsum = cumsum(linked0), 
      linked = ifelse(linkedsum==1, linked0, 0)) 

     ID  Date Mode Time linked0 linkedsum linked 
    <int> <chr> <chr> <int> <lgl>  <int> <dbl> 
1 1234 12/10/16 Bus 120 FALSE   0  0 
2 1234 12/10/16 Bus 130 TRUE   1  1 
3 1234 12/10/16 Bus 290 FALSE   1  0 
4 1234 12/10/16 Train 310 FALSE   0  0 
5 1234 12/10/16 Bus 330 TRUE   2  0 
6 4567 12/10/16 Bus 220 FALSE   0  0 
7 4567 12/10/16 Bus 230 TRUE   1  1 
8 4567 13/10/16 Bus 290 FALSE   0  0 
9 4567 13/10/16 Bus 450 FALSE   0  0 
10 4567 14/10/16 Train 1000 FALSE   0  0

要在數據庫中執行此，請參閱dplyr database vignette。

來源

2016-10-18 13:26:53

循環/如果其他在R數據幀

回答

相關問題