Aggregate large data.frame

我有一個大數據框（> 1.000.000條目），其中一列包含日期/時間變量，一列包含數值。問題在於某些日期/時間變量會出現兩次或三次，並且相應的數值需要進行平均，因此每個日期/時間變量最終都會有一個數字值。Aggregate large data.frame

到現在爲止，我做了以下內容：

## audio_together is the dataframe with two colums $timestamp and $amplitude 
## (i.e. the numeric value) 

timestamp_unique <- unique(audio_together$timestamp) ## find all timestamps 
    audio_together3 <- c(rep(NA, length(timestamp_unique))) ## audio_together 3 is the new vector containing the values for each timestamp 
    count = 0 
    for (k in 1:length(timestamp_unique)){ 
    temp_time <- timestamp_unique[k] 
    if (k==1){ 
     temp_subset <- audio_together[(1:10),] ## look for timestamps only in a subset, which definitely contains the timestamp we are looking for 
     temp_data_which <- which(temp_subset$timestamp == temp_time) 
    } else { 
     temp_subset <- audio_together[((count):(count+9)),] 
     temp_data_which <- which(temp_subset$timestamp == temp_time) 
    } 
    if (length(temp_data_which) > 1){ 
     audio_together3[k] <- mean(temp_subset$amplitude[temp_data_which], na.rm = T) 
    } else { 
     audio_together3[k] <- temp_subset$amplitude[temp_data_which] 
    } 
    count <- count + length(temp_data_which) 
    }

然而，這個過程還是相當緩慢。任何想法都很重要（即在幾分鐘的時間範圍內）加快了這個過程？

UPDATE：例

timestamp <- c("2015-09-03 18:54:13", "2015-09-03 18:54:14", "2015-09-03 18:54:14", "2015-09-03 18:54:15", "2015-09-03 18:54:15", "2015-09-03 18:54:16", "2015-09-03 18:54:16", "2015-09-03 18:54:17", "2015-09-03 18:54:17") 
amplitude <- c(200, 313, 321, 432, 111, 423, 431, 112, 421) 

audio_together <- data.frame(timestamp, amplitude)

來源

2016-04-05 Christine Blume

你能提供您一個小數據樣本顯著和預期輸出？像你想要的那樣分組可以很多方式進行處理：在基數R中使用'tapply'，'ave'和'aggregate'。'data.table'和'dplyr'包很可能會提供所需的速度。 – nicola

'library（data.table）; setDT（audio_together）; audio_together [，。（amplitude = mean（amplitude，na.rm = TRUE）），by = timestamp] – Roland

您是否檢查[this]（http://stackoverflow.com/questions/21982987/mean-per-group-在-A-數據幀）？ –

很難測試而不minimal reproducible example但如果你的目的是平均所有amplitude共享相同的timestamp，那麼這個dplyr解決方案可幫助：

library(dplyr) 
audio_together %>% 
    group_by(timestamp) %>% 
    summarize(av_amplitude=mean(amplitude, na.rm=T)) %>% 
    ungroup()

來源

2016-04-05 10:26:45

謝謝爲您的想法。

以下作品完美：

require(dplyr) 
audio_together <- audio_together %>% group_by(timestamp) 
audio_together <- ungroup(audio_together %>% summarise(mean(amplitude, na.rm=T)))

來源

2016-04-05 13:31:44

Aggregate large data.frame

回答

相關問題