R：通過聚合OHLC系列中的值來減少時間序列數據的頻率

我有一個高頻數據集，用於匯率下降到毫秒，我希望將其轉換爲R中的低頻和常規時間序列數據。每分鐘或5分鐘OHLC系列（開放，高，低，關閉）。原始數據集有四列，一列用於匯率，一列用於時間戳，其中包括日期和時間以及出價和要價的列。數據已從.csv文件導入。R：通過聚合OHLC系列中的值來減少時間序列數據的頻率

{head(GBPUSD)}和{tail(GBPUSD)}返回如下：

# A tibble: 6 x 4 
     X1     X2  X3  X4 
    <chr>    <dttm> <dbl> <dbl> 
1 GBP/USD 2017-06-01 00:00:00 1.28756 1.28763 
2 GBP/USD 2017-06-01 00:00:00 1.28754 1.28760 
3 GBP/USD 2017-06-01 00:00:00 1.28754 1.28759 
4 GBP/USD 2017-06-01 00:00:00 1.28753 1.28759 
5 GBP/USD 2017-06-01 00:00:00 1.28753 1.28759 
6 GBP/USD 2017-06-01 00:00:00 1.28753 1.28759 


# A tibble: 6 x 4 
     X1     X2  X3  X4 
    <chr>    <dttm> <dbl> <dbl> 
1 GBP/USD 2017-06-30 20:59:56 1.30093 1.30300 
2 GBP/USD 2017-06-30 20:59:56 1.30121 1.30300 
3 GBP/USD 2017-06-30 20:59:56 1.30100 1.30390 
4 GBP/USD 2017-06-30 20:59:56 1.30146 1.30452 
5 GBP/USD 2017-06-30 20:59:56 1.30145 1.30447 
6 GBP/USD 2017-06-30 20:59:56 1.30145 1.30447

來源

2017-12-18 hx1

如果包含'head（yourdata）'，'tail（yourdata）'，這將是有用的。此外，imgur.com不起作用。您可以使用任何其他存儲。 –

謝謝，請找到頭（尾巴沒有足夠的空間）。這些數據已經從.csv直接導入filefile＃A tibble：6×4 X1 X2 X3 X4 1 GBP/USD 2017年6月1日00:00:00 1.28756 1.28763 2英鎊/美元2017年6月1日00:00:00 1.28754 1.28760 3英鎊/美元2017年6月1日00:00:00 1.28754 1.28759 4英鎊/美元2017年6月1日00:00:00 1.28753 1.28759 5英鎊/美元2017年6月1日00:00:00 1.28753 1.28759 6英鎊/美元2017年6月1日00:00:00 1.28753 1.28759 – hx1

據此編輯你的問題;不在評論部分，而是在原始問題部分。此外，請使用「{}」代碼符號清楚地顯示您的數據。的[四捨五入時間最近的一刻鐘] –

我改變了一點點的OP的原始數據集下面的教學/教學方面的原因：

df <- data.frame(
X1=c("GBP/USD"), 
X2=c("2017-06-01 00:00:00", "2017-06-01 00:00:00", "2017-06-01 00:00:01", "2017-06-01 00:00:01", "2017-06-01 00:00:01", "2017-06-01 00:00:02", "2017-06-30 20:59:52", "2017-06-30 20:59:54", "2017-06-30 20:59:54", "2017-06-30 20:59:56", "2017-06-30 20:59:56", "2017-06-30 20:59:56"), 
X3=c(1.28756, 1.28754, 1.28754, 1.28753, 1.28752, 1.28757, 1.30093, 1.30121, 1.30100, 1.30146, 1.30145,1.30145), 
X4=c(1.28763, 1.28760, 1.28759, 1.28758, 1.28755, 1.28760,1.30300, 1.30300, 1.30390, 1.30452, 1.30447, 1.30447), 
stringsAsFactors=FALSE) 

df 

     X1     X2  X3  X4 
1 GBP/USD 2017-06-01 00:00:00 1.28756 1.28763 
2 GBP/USD 2017-06-01 00:00:00 1.28754 1.28760 
3 GBP/USD 2017-06-01 00:00:01 1.28754 1.28759 
4 GBP/USD 2017-06-01 00:00:01 1.28753 1.28758 
5 GBP/USD 2017-06-01 00:00:01 1.28752 1.28755 
6 GBP/USD 2017-06-01 00:00:02 1.28757 1.28760 
7 GBP/USD 2017-06-30 20:59:52 1.30093 1.30300 
8 GBP/USD 2017-06-30 20:59:54 1.30121 1.30300 
9 GBP/USD 2017-06-30 20:59:54 1.30100 1.30390 
10 GBP/USD 2017-06-30 20:59:56 1.30146 1.30452 
11 GBP/USD 2017-06-30 20:59:56 1.30145 1.30447 
12 GBP/USD 2017-06-30 20:59:56 1.30145 1.30447

現在，在低頻的數據，將有成爲相同事物的分組。所以，我們必須找到對應唯一startings指數，以及各組的結局：

indices <- seq_along(df[,2])[!(duplicated(df[,2]))] # 1 3 6 7 8 10; the beginnings of groups (observations) 
indices - 1 # 0 2 5 6 7 9; for finding the endings of groups 
numberoflowfreq <- length(indices) # 6: number of groupings (obs.) for Low Freq data

公然寫明白的模式：

mean(df[1:((indices -1)[2]),3]) # from 1 to 2 
mean(df[indices[2]:((indices -1)[3]),3]) # from 3 to 5 
mean(df[indices[3]:((indices -1)[4]),3]) # from 6 to 6 
mean(df[indices[4]:((indices -1)[5]),3]) # from 7 to 7 
mean(df[indices[5]:((indices -1)[6]),3]) # from 8 to 9 
mean(df[indices[6]:nrow(df),3]) # from 10 to 12

簡化模式：

mean3rdColumn_1st <- mean(df[1:((indices -1)[2]),3]) # from 1 to 2 
mean3rdColumn_Between <- sapply(2:(numberoflowfreq-1), function(i) mean(df[indices[i]:((indices -1)[i+1]),3])) 
mean3rdColumn_Last <- mean(df[indices[6]:nrow(df),3]) # from 10 to 12 
# 3rd column in low frequency data:  
c(mean3rdColumn_1st, mean3rdColumn_Between, mean3rdColumn_Last)

同樣對於第4列：

mean4thColumn_1st <- mean(df[1:((indices -1)[2]),4]) # from 1 to 2 
mean4thColumn_Between <- sapply(2:(numberoflowfreq-1), function(i) mean(df[indices[i]:((indices -1)[i+1]),4])) 
mean4thColumn_Last <- mean(df[indices[6]:nrow(df),4]) # from 10 to 12 
# 4th column in low frequency data: 
c(mean4thColumn_1st, mean4thColumn_Between, mean4thColumn_Last)

收集所有的努力：現在

LowFrqData <- data.frame(X1=c("GBP/USD"), X2=df[indices,2], X3=c(mean3rdColumn_1st, mean3rdColumn_Between, mean3rdColumn_Last), x4=c(mean4thColumn_1st, mean4thColumn_Between, mean4thColumn_Last), stringsAsFactors=FALSE) 
LowFrqData 

     X1     X2  X3  x4 
1 GBP/USD 2017-06-01 00:00:00 1.287550 1.287615 
2 GBP/USD 2017-06-01 00:00:01 1.287530 1.287573 
3 GBP/USD 2017-06-01 00:00:02 1.287570 1.287600 
4 GBP/USD 2017-06-30 20:59:52 1.300930 1.303000 
5 GBP/USD 2017-06-30 20:59:54 1.301105 1.303450 
6 GBP/USD 2017-06-30 20:59:56 1.301453 1.304487

，列X2具有獨特的分鐘值，X3和X4被相關細胞的形成。

另請注意：某個範圍內的所有分鐘數可能不會有值。對於這種情況，您可以抽取NA。另一方面，在這種情況下，人們可能會忽略不規則的影響，因爲觀察的間隔對於許多觀察來說可能是相同的，因此不是非常不規則。還要考慮使用線性內插將數據轉換爲等距觀測的事實可以引入一些重要且難以量化的偏差（參見：Scholes和Williams）。

M. Scholes and J. Williams, 「Estimating betas from nonsynchronous data」, Journal of Financial Economics 5: 309–327, 1977.

現在，經常5分鐘系列部分：

as.numeric(as.POSIXct("1970-01-01 03:00:00")) # 0; starting point for ZERO seconds. "1970-01-01 03:01:00" equals 60. 
as.numeric(as.POSIXct("2017-06-01 00:00:00")) # 1496264400 
# Passed seconds after the first observation in the dataset 
PassedSecs <- as.numeric(as.POSIXct(LowFrqData$X2)) - 1496264400 

LowFrq5minuteRaw <- cbind(LowFrqData, PassedSecs, stringsAsFactors=FALSE) 
LowFrq5minuteRaw 

     X1     X2  X3  x4 PassedSecs 
1 GBP/USD 2017-06-01 00:00:00 1.287550 1.287615   0 
2 GBP/USD 2017-06-01 00:00:01 1.287530 1.287573   1 
3 GBP/USD 2017-06-01 00:00:02 1.287570 1.287600   2 
4 GBP/USD 2017-06-30 20:59:52 1.300930 1.303000 2581192 
5 GBP/USD 2017-06-30 20:59:54 1.301105 1.303450 2581194 
6 GBP/USD 2017-06-30 20:59:56 1.301453 1.304487 2581196

5分鐘裝置5 * 60 = 300秒。因此，「在300分鐘內具有相同的商數」以5分鐘爲間隔對觀測結果進行分組。

LowFrq5minuteRaw2 <- cbind(LowFrqData, PassedSecs, QbyDto300 = PassedSecs%/%300, stringsAsFactors=FALSE) 
LowFrq5minuteRaw2 

     X1     X2  X3  x4 PassedSecs QbyDto300 
1 GBP/USD 2017-06-01 00:00:00 1.287550 1.287615   0   0 
2 GBP/USD 2017-06-01 00:00:01 1.287530 1.287573   1   0 
3 GBP/USD 2017-06-01 00:00:02 1.287570 1.287600   2   0 
4 GBP/USD 2017-06-30 20:59:52 1.300930 1.303000 2581192  8603 
5 GBP/USD 2017-06-30 20:59:54 1.301105 1.303450 2581194  8603 
6 GBP/USD 2017-06-30 20:59:56 1.301453 1.304487 2581196  8603 

indices2 <- seq_along(LowFrq5minuteRaw2[,6])[!(duplicated(LowFrq5minuteRaw2[,6]))] # 1 4; the beginnings of groups 

LowFrq5minute <- data.frame(X1=c("GBP/USD"), X2=LowFrq5minuteRaw2[indices2,2], X3=aggregate(LowFrqData[,3] ~ QbyDto300, LowFrq5minuteRaw2, mean)[,2], X4=aggregate(LowFrqData[,4] ~ QbyDto300, LowFrq5minuteRaw2, mean)[,2]) 
LowFrq5minute 

     X1     X2  X3  X4 
1 GBP/USD 2017-06-01 00:00:00 1.287550 1.287596 
2 GBP/USD 2017-06-30 20:59:52 1.301163 1.303646

X2持有5分鐘OBS的趴在區間的代表第一次出現次數的時間戳。

來源

2017-12-18 21:01:25

我認爲所有這些會更容易aggregate函數。雖然，根據數據，您可能需要將日期時間列轉換爲字符（以防原始數據保留毫秒值）。如果需要，我建議使用lubridate將它們轉換回日期時間。

GBPUSD$X2 <- as.character(GBPUSD$X2) #optional; if the below yields bad results 
GBPUSD$X2 <- substr(GBPUSD$X2, 1, 19) #optional; to get only upto minutes after above command 
# get High values for both bid and ask prices: 
GBPUSD_H <- aggregate(cbind(X3, X4)~X1+X2, data=GBPUSD, FUN=max) 
# get Low values for both bid and ask prices: 
GBPUSD_L <- aggregate(cbind(X3, X4)~X1+X2, data=GBPUSD, FUN=min) 
# merging the High and low values together 
GBPUSD_NEW <- data.table::merge(GBPUSD_H, GBPUSD_L, by=c("X1", "X2"), suffixes=c(".HIGH", ".LOW"))

要獲得所有高，低，開盤，一次性&關閉值：

GBPUSD <- data.table(GBPUSD, key=c("X1", "X2")) 
GBPUSD_NEW <- GBPUSD[, list(X3.HIGH=max(X3), X3.LOW=min(X3), X3.OPEN=X3[1], 
          X3.CLOSE=X3[length(X3)], X4.HIGH=max(X4), X4.LOW=min(X4), 
          X4.OPEN=X4[1], X4.CLOSE=X4[length(X4)]), by=c("X1", "X2")]

然而，對於這項工作，首先需要對數據進行排序，使得第一值開放值和最後值是每秒的接近值。

現在，如果您需要使用分鐘而不是秒（或小時），只需相應地調整substr即可。如果你想要更多的自定義，比如15分鐘的時間間隔，我會建議添加一個輔助列。示例代碼：

GBPUSD$MIN <- floor(as.numeric(substr(GBPUSD$X2, 15, 16))/15) #getting 00:00 for 00:00-00:15 
GBPUSD$X2 <- paste0(substr(GBPUSD$X2, 1, 14), GBPUSD$MIN, ":00")

請不要猶豫，問，如果你的要求不被滿足。

P.S。：NA s在aggregate中創建問題，如果關鍵字列具有它們。首先處理它們。

GBPUSD$X2[is.na(GBPUSD$X2)] <- "2017:05:05 00:00:00" #example; you need to be careful to use same class and format for the replacement

來源

2017-12-19 08:50:08 Arani

當你想嘗試真棒tibbletime包這是超級完美的例子。我將產生我自己的數據做出點

library(tibbletime) 
df <- tibbletime::create_series(2017-12-20 + 01:06:00 ~ 2017-12-20 + 01:20:00, "sec") %>% 
     mutate(open=runif(nrow(.)), 
       close=runif(nrow(.))) 
df

這是現在的15分鐘

# A time tibble: 841 x 3 
# Index: date 
        date  open  close 
*    <dttm>  <dbl>  <dbl> 
1 2017-12-20 01:06:00 0.63328803 0.357378011 
2 2017-12-20 01:06:01 0.09597444 0.150583962 
3 2017-12-20 01:06:02 0.23601820 0.974341599 
4 2017-12-20 01:06:03 0.71832656 0.092265867 
5 2017-12-20 01:06:04 0.32471587 0.391190310 
6 2017-12-20 01:06:05 0.76378711 0.534765217 
7 2017-12-20 01:06:06 0.92463265 0.694693458 
8 2017-12-20 01:06:07 0.74026638 0.006054806 
9 2017-12-20 01:06:08 0.77064030 0.911641146 
10 2017-12-20 01:06:09 0.87130949 0.740816479 
# ... with 831 more rows

更改數據的週期性的秒分辨率的數據是那麼容易，因爲一個命令：

as_period(df, 5~M)

這將聚集數據以5間分鐘的間隔（tibbletime拾取第一觀察默認不平均或總和每個週期）

# A time tibble: 3 x 3 
# Index: date 
       date  open  close 
*    <dttm>  <dbl>  <dbl> 
1 2017-12-20 01:06:00 0.6332880 0.3573780 
2 2017-12-20 01:11:00 0.9235639 0.7043025 
3 2017-12-20 01:16:00 0.6955685 0.1641798

退房這個真棒vignette瞭解更多詳情

來源

2017-12-20 00:16:17 dmi3kno

如果「'tibbletime'挑選第一觀察每一個時期在默認情況下，沒有平均或求和」如你所說，那麼，這是否意味着'tibbletime'失去在比第一觀察其他觀察信息？對我而言，應該使用數據集中的所有信息。 –

查看軟件包文檔。有'time_summarize'和'time_collapse'。時間序列聚合並不總是有意義的。想象一下你只是減少測量次數。平均值永遠不會與現實生活相匹配，並可能被異常值歪曲。 – dmi3kno

似乎要打開每一列（買價，賣價）到4列（開放式，高，低，關閉），通過像5一段時間間隔分組分鐘。我欣賞@ dmi3kno展示了幾個tibbletime功能，但我認爲這可能會做更多你想要的。

~~請注意，這將在下一個版本 tibbletime中發生一些變化，但目前在 0.0.2之下有效。~~

對於每5分鐘的期間，買入和賣出兩欄的開盤價/最高價/最低價/收盤價被採納。

library(tibbletime) 
library(dplyr) 

df <- create_series("2017-12-20 00:00:00" ~ "2017-12-20 01:00:00", "sec") %>% 
    mutate(bid = runif(nrow(.)), 
     ask = bid + .0001) 
df 
#> # A time tibble: 3,601 x 3 
#> # Index: date 
#> date     bid ask 
#> * <dttm>    <dbl> <dbl> 
#> 1 2017-12-20 00:00:00 0.208 0.208 
#> 2 2017-12-20 00:00:01 0.0629 0.0630 
#> 3 2017-12-20 00:00:02 0.505 0.505 
#> 4 2017-12-20 00:00:03 0.0841 0.0842 
#> 5 2017-12-20 00:00:04 0.986 0.987 
#> 6 2017-12-20 00:00:05 0.225 0.225 
#> 7 2017-12-20 00:00:06 0.536 0.536 
#> 8 2017-12-20 00:00:07 0.767 0.767 
#> 9 2017-12-20 00:00:08 0.994 0.994 
#> 10 2017-12-20 00:00:09 0.807 0.808 
#> # ... with 3,591 more rows 

df %>% 
    mutate(date = collapse_index(date, "5 min")) %>% 
    group_by(date) %>% 
    summarise_all(
    .funs = funs(
     open = dplyr::first(.), 
     high = max(.), 
     low = min(.), 
     close = dplyr::last(.) 
    ) 
) 
#> # A time tibble: 13 x 9 
#> # Index: date 
#> date    bid_o… ask_o… bid_h… ask_h… bid_low ask_low bid_c… 
#> * <dttm>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 
#> 1 2017-12-20 00:04:59 0.208 0.208 1.000 1.000 0.00293 3.03e⁻³ 0.389 
#> 2 2017-12-20 00:09:59 0.772 0.772 0.997 0.997 0.000115 2.15e⁻⁴ 0.676 
#> 3 2017-12-20 00:14:59 0.457 0.457 0.995 0.996 0.00522 5.32e⁻³ 0.363 
#> 4 2017-12-20 00:19:59 0.586 0.586 0.997 0.997 0.00912 9.22e⁻³ 0.0339 
#> 5 2017-12-20 00:24:59 0.385 0.385 0.998 0.998 0.0131 1.32e⁻² 0.0907 
#> 6 2017-12-20 00:29:59 0.548 0.548 0.996 0.996 0.00126 1.36e⁻³ 0.320 
#> 7 2017-12-20 00:34:59 0.240 0.240 0.995 0.995 0.00466 4.76e⁻³ 0.153 
#> 8 2017-12-20 00:39:59 0.404 0.405 0.999 0.999 0.000481 5.81e⁻⁴ 0.709 
#> 9 2017-12-20 00:44:59 0.468 0.468 0.999 0.999 0.00101 1.11e⁻³ 0.0716 
#> 10 2017-12-20 00:49:59 0.580 0.580 0.996 0.996 0.000336 4.36e⁻⁴ 0.395 
#> 11 2017-12-20 00:54:59 0.242 0.242 0.999 0.999 0.00111 1.21e⁻³ 0.762 
#> 12 2017-12-20 00:59:59 0.474 0.474 0.987 0.987 0.000858 9.58e⁻⁴ 0.335 
#> 13 2017-12-20 01:00:00 0.974 0.974 0.974 0.974 0.974 9.74e⁻¹ 0.974 
#> # ... with 1 more variable: ask_close <dbl>

更新：該帖已被更新，以反映tibbletime 0.1.0的變化。

來源

2017-12-20 20:12:53

謝謝戴維斯。我沒有在tibbletime中看到謂詞函數，所以認爲它會下課。同意這會額外邁向期望的結果。 – dmi3kno

@ dmi3kno，像'summarise_all謂詞函數（）''使用總結（）內置'引擎蓋下，所以不上課被丟棄！ –

R：通過聚合OHLC系列中的值來減少時間序列數據的頻率

回答

相關問題