如何在R中處理大型數據集時優化並加快循環？

目前，我正在進行數據轉換。數據不是很大，大約有19萬行。如何在R中處理大型數據集時優化並加快循環？

我寫了一個循環是這樣的：

for (i in 1:nrow(df2)){ 
#a 
record.a <- df[which(df$first_lat==df2[i,"third_lat"] 
      & df$first_lon==df2[i,"third_lon"] 
      & df$sixth_lat==df2[i,"fourth_lat"] 
      & df$sixth_lon==df2[i,"fourth_lon"] 
      & df[,4]==df2[i,4] 
      & df[,3]==df2[i,5]),] 
df2[i,18] <- ifelse(nrow(record.a) != 0,record.a$order_cnt,NA) 

#b 
record.b <- df[which(df$fifth_lat==df2[i,"third_lat"] 
      & df$fifth_lon==df2[i,"third_lon"] 
      & df$sixth_lat==df2[i,"second_lat"] 
      & df$sixth_lon==df2[i,"second_lon"] 
      & df[,4]==df2[i,4] 
      & df[,3]==df2[i,5]),] 
df2[i,19] <- ifelse(nrow(record.b) != 0,record.b$order_cnt,NA) 

#c 
record.c <- df[which(df$fifth_lat==df2[i,"first_lat"] 
      & df$fifth_lon==df2[i,"first_lon"] 
      & df$fourth_lat==df2[i,"second_lat"] 
      & df$fourth_lon==df2[i,"second_lon"] 
      & df[,4]==df2[i,4] 
      & df[,3]==df2[i,5]),] 
df2[i,20] <- ifelse(nrow(record.c) != 0,record.c$order_cnt,NA) 

#d 
record.d <- df[which(df$third_lat==df2[i,"first_lat"] 
      & df$third_lon==df2[i,"first_lon"] 
      & df$fourth_lat==df2[i,"sixth_lat"] 
      & df$fourth_lon==df2[i,"sixth_lon"] 
      & df[,4]==df2[i,4] 
      & df[,3]==df2[i,5]),] 
df2[i,21] <- ifelse(nrow(record.d) != 0,record.d$order_cnt,NA) 

#e 
record.e <- df[which(df$third_lat==df2[i,"fifth_lat"] 
      & df$third_lon==df2[i,"fifth_lon"] 
      & df$second_lat==df2[i,"sixth_lat"] 
      & df$second_lon==df2[i,"sixth_lon"] 
      & df[,4]==df2[i,4] 
      & df[,3]==df2[i,5]),] 
df2[i,22] <- ifelse(nrow(record.e) != 0,record.e$order_cnt,NA) 

#f 
record.f <- df[which(df$first_lat==df2[i,"fifth_lat"] 
      & df$first_lon==df2[i,"fifth_lon"] 
      & df$second_lat==df2[i,"fourth_lat"] 
      & df$second_lon==df2[i,"fourth_lon"] 
      & df[,4]==df2[i,4] 
      & df[,3]==df2[i,5]),] 
df2[i,23] <- ifelse(nrow(record.f) != 0,record.f$order_cnt,NA) 
}

所以，基本上，我需要從DF 6個標準分別填寫DF2的6列。在for循環中，nrow（df2）約爲190k。它運行速度超慢。但我用查看（df2）來檢查它，它運行良好。那麼有什麼方法可以讓它更快？我可能會在未來將相同的數據轉換應用於更大的數據集。

DF： df

DF2： df2

的數據是在地圖上的網格。 df2基本上是df的一個子集，但增加了6個額外的列。 df和df2都有相同的lon和lat信息。

每個grid_id代表地圖中的六邊形區域。每個六邊形通過兩對lon和lat連接到其他六個六邊形。我想要做的是從六個周圍的六邊形（以df）中找出一個特定值，填入df2中的列（a，b，c，d，e，f）。另外，我還需要其他兩個條件，即幾個小時，ten_mins_interval。（DF [，4] == DF2 [I，4] & DF [3] == DF2 [I，5]））

因此，我認爲邏輯是：

對於每個grid_id在DF2小時，ten_mins_interval（1行）
找到對應的6個grid_ids（6行）與相同小時，ten_mins_interval在DF從這些6行
填充order_cnt分爲A，b，C，d，E，F df2中的列

來源

2017-06-07 hide on bush

您能提供一個可重複使用的小例子嗎？在問題中粘貼輸出'dput（head（df））'和'dput（head（df2 [，18：23]））''。 – Jimbou

for循環幾乎總是不必要的，但您需要共享一些樣本數據和預期結果，以便更容易理解您所需的內容。也許還可以簡化問題 - 更少的列 –

目前您不太可能得到完整的答案，因爲問題不可重現，即沒有示例數據顯示df和df2的結構。最可能的加速方法似乎是對6個塊中的每個塊使用「合併」函數來避免「for」循環 – Miff

如果您從當前的開始210你可以用合併命令添加df[,18]：

df2 <- merge(df[,c("first_lat","first_lon","sixth_lat","sixth_lon","col4name","col5name","order_cn")], 
     df2, 
     by.x=c("first_lat","first_lon","sixth_lat","sixth_lon","col4name","col5name"), 
     by.y=c("third_lat","third_lon","fourth_lat","fourth_lon","col4name","col3name"), 
     all.y=TRUE)

您需要與第四列等的名稱，以取代col4name - 我不能從截圖什麼，可能是看到。該命令的另外五個版本可以輕鬆生成以添加其他五列。由於該操作在時間上對整個向量有效，因此它可能比循環更快。由於數據沒有以合適的格式提供，因此未經測試。

來源

2017-06-07 15:32:37 Miff

如何在R中處理大型數據集時優化並加快循環？

回答

相關問題