我有一個大數據框(> 1.000.000條目),其中一列包含日期/時間變量,一列包含數值。問題在於某些日期/時間變量會出現兩次或三次,並且相應的數值需要進行平均,因此每個日期/時間變量最終都會有一個數字值。Aggregate large data.frame
到現在爲止,我做了以下內容:
## audio_together is the dataframe with two colums $timestamp and $amplitude
## (i.e. the numeric value)
timestamp_unique <- unique(audio_together$timestamp) ## find all timestamps
audio_together3 <- c(rep(NA, length(timestamp_unique))) ## audio_together 3 is the new vector containing the values for each timestamp
count = 0
for (k in 1:length(timestamp_unique)){
temp_time <- timestamp_unique[k]
if (k==1){
temp_subset <- audio_together[(1:10),] ## look for timestamps only in a subset, which definitely contains the timestamp we are looking for
temp_data_which <- which(temp_subset$timestamp == temp_time)
} else {
temp_subset <- audio_together[((count):(count+9)),]
temp_data_which <- which(temp_subset$timestamp == temp_time)
}
if (length(temp_data_which) > 1){
audio_together3[k] <- mean(temp_subset$amplitude[temp_data_which], na.rm = T)
} else {
audio_together3[k] <- temp_subset$amplitude[temp_data_which]
}
count <- count + length(temp_data_which)
}
然而,這個過程還是相當緩慢。任何想法都很重要(即在幾分鐘的時間範圍內)加快了這個過程?
UPDATE:例
timestamp <- c("2015-09-03 18:54:13", "2015-09-03 18:54:14", "2015-09-03 18:54:14", "2015-09-03 18:54:15", "2015-09-03 18:54:15", "2015-09-03 18:54:16", "2015-09-03 18:54:16", "2015-09-03 18:54:17", "2015-09-03 18:54:17")
amplitude <- c(200, 313, 321, 432, 111, 423, 431, 112, 421)
audio_together <- data.frame(timestamp, amplitude)
你能提供您一個小數據樣本顯著和預期輸出?像你想要的那樣分組可以很多方式進行處理:在基數R中使用'tapply','ave'和'aggregate'。'data.table'和'dplyr'包很可能會提供所需的速度。 – nicola
'library(data.table); setDT(audio_together); audio_together [,。(amplitude = mean(amplitude,na.rm = TRUE)),by = timestamp] – Roland
您是否檢查[this](http://stackoverflow.com/questions/21982987/mean-per-group-在-A-數據幀)? –