The tidyverse has gained quite a lot of popularity lately. It provides an opiniated eco-system mainly for exploratory data analysis. This coherent and unified way to work with data in R has led to the outgrow of a few projects which have based their philosophy on the same principles. We are thinking here in the tidyquant, corrr, tidytext, broom, etc. They are many others, but these are the ones I have used most recently.

While dealing with financial data and / or function in R, one is used to the xts and zoo libraries. These are phenomenal at what they do and one can see that they have been developed by financial professionals. I am not one of them yet. They have also generated a whole eco-system of other libraries, such as quanstrat, quantmod, TTR and PerformanceAnalytics, etc. That said, coming from the tibble environment, I did not find the benefit of dealing with xts instead of tibble worth its learning curve (and yes, maybe I have not done enough work with finance to fully appreciate it) and/or its limitation(s) and/or its error comments (thinking here about failed backtests in quantstrat) and / or its smaller user base (try posting a quanstrat question on SO).

As I started building shiny apps and doing some basics financial analysis, I wondered if the tidyverse was actually efficient in doing these type of tasks or if that was just the panacea of the xts format.

So here is the results of some basic financial functions microbenchmarks.

Let’s load first our libraries and a data frame of prices.

library(microbenchmark)
library(dplyr)
df <- readr::read_csv("../R Trading Report/ETF/ETF_Data/SPY.csv")
df2 <- df %>% select(Index, Adjusted)
df2$Index <- lubridate::ymd(df2$Index)
df2 <- tibble::as_tibble(df2[,1:2])
df3 <- df2 %>% select(Adjusted) %>% xts::xts(order.by = df2$Index)

Caluclate Rate of Change, aka ROC

The first operation is probably the most common one and it is the percentage return or rate of change. There are quite a few ways in R to calculate a rate of change.

Tidyways ….

Let’s try first the tidyverse version We calculate a rolling weekly (5 days) ROC

roc_func_tidy <- function(df) {
  df$ROC <- (df$Adjusted - lag(df$Adjusted, 5)) / lag(df$Adjusted, 5)
}

res5days_tidy <- microbenchmark(roc_func_tidy(df2), times = 1000)

res5days_tidy
## Unit: microseconds
##                expr     min       lq     mean   median      uq      max
##  roc_func_tidy(df2) 213.065 216.4985 371.4284 222.0365 266.298 48318.45
##  neval
##   1000

Let’s try with a slightly different way to calculate the RoC

roc_func_tidy_b <- function(df) {
  df$ROC <- ((df$Adjusted / lag(df$Adjusted, 5)) - 1)
}

res5days_tidy_b <- microbenchmark(roc_func_tidy_b(df2), times = 1000)

res5days_tidy_b
## Unit: microseconds
##                  expr    min       lq     mean   median      uq      max
##  roc_func_tidy_b(df2) 134.38 137.4175 195.6482 140.0305 167.563 2433.308
##  neval
##   1000

That’s much better already.

Let’s now try with a 3 months rolling period

roc_func_tidy <- function(df) {
  df$ROC <- (df$Adjusted - lag(df$Adjusted, 63)) / lag(df$Adjusted, 63)
}

res3months_tidy <- microbenchmark(roc_func_tidy(df2), times = 1000)

res3months_tidy
## Unit: microseconds
##                expr     min       lq     mean median      uq     max neval
##  roc_func_tidy(df2) 212.428 216.0545 314.3081 225.66 263.523 2577.11  1000

And now with the other way to calculate RoC

roc_func_tidy_b <- function(df) {
  df$ROC <- ((df$Adjusted / lag(df$Adjusted, 63)) -1)
}

res3months_tidy_b <- microbenchmark(roc_func_tidy_b(df2), times = 1000)

res3months_tidy_b
## Unit: microseconds
##                  expr     min       lq     mean  median       uq      max
##  roc_func_tidy_b(df2) 134.137 137.5195 198.9563 149.516 170.6625 2180.334
##  neval
##   1000

2 observations here:

  • doing 1 call to the lag function is better than doing 2 calls to it … Duuh
  • there isn’t much difference between a 5 days look back or a 63 days lookback.

The TTR way

Let’s try now with the TTR package

roc_func_TTR <- function(df){
  df$ROC <- TTR::ROC(df$Adjusted, n=5, type = "discrete")
}

res5days_TTR <- microbenchmark(roc_func_TTR(df2), times = 1000)

res5days_TTR
## Unit: microseconds
##               expr     min      lq     mean   median      uq      max
##  roc_func_TTR(df2) 540.879 551.751 756.7698 581.1285 688.155 49931.62
##  neval
##   1000

Does it matter that we used a tibble instead of a xts object?

roc_func_TTR <- function(df){
  df$ROC <- TTR::ROC(df$Adjusted, n=5, type = "discrete")
}

res3months_TTR <- microbenchmark(roc_func_TTR(df3), times = 1000)

res3months_TTR
## Unit: microseconds
##               expr     min       lq     mean  median       uq      max
##  roc_func_TTR(df3) 562.352 581.4035 1048.987 673.917 829.3075 50833.08
##  neval
##   1000

Oh yes, it does seem to matter actually.

And on a 3 month RoC.

roc_func_TTR <- function(df){
  df$ROC <- TTR::ROC(df$Adjusted, n=63, type = "discrete")
}

res3months_TTR <- microbenchmark(roc_func_TTR(df3), times = 1000)

res3months_TTR
## Unit: microseconds
##               expr     min      lq    mean   median       uq     max neval
##  roc_func_TTR(df3) 559.134 640.258 1338.63 808.0975 1162.133 54106.3  1000

Conclusion

  • TTR takes more time to process a xts object than a data frame / tibble object. That was a suprise to me
  • TTR takes more time than dplyr and its lag function to calculate rolling ROC
  • It’s faster to call lag() once instead of twice.

Calculate Simple Moving Average

Here we will calculate a 50 days Simple moving average. Rolling windows calculations is not yet a strenght of the tidyverse … too bad ;-( I have seen an issue raised on github / dplyr on this. So I’am hopping this post might help raise the idea of rolling functions a few box up in Hadley Wickham long to-do list :-)

After trying some of the popular packages to do the job, we bring up two libaries found on github that do the exact job but were unknown to me up to today.

TTR package

Let’s first start with the TTR package on a tibble, then on a xts object.

sma_func_TTR <- function(df){
  df$SMA <- TTR::SMA(df$Adjusted, n=50, type = "discrete")
}

res50days_TTR <- microbenchmark(sma_func_TTR(df2), times = 1000)

res50days_TTR
## Unit: microseconds
##               expr     min       lq     mean  median      uq      max
##  sma_func_TTR(df2) 623.868 644.9085 910.6772 765.743 922.963 3561.169
##  neval
##   1000

what about using a df instead of the xts

sma_func_TTR <- function(df){
  df$SMA <- TTR::SMA(df$Adjusted, n=50, type = "discrete")
}

res50days_df2_TTR <- microbenchmark(sma_func_TTR(df3), times = 1000)

res50days_df2_TTR
## Unit: microseconds
##               expr     min       lq    mean   median       uq      max
##  sma_func_TTR(df3) 900.001 1032.233 2127.51 1283.178 2206.868 96932.09
##  neval
##   1000

Nope… xts is really not helping here.

What about the runMean function?

sma_func_TTR <- function(df){
  df$SMA <- TTR::runMean(df$Adjusted, n=50)
}

res50days_TTR <- microbenchmark(sma_func_TTR(df2), times = 1000)

res50days_TTR
## Unit: microseconds
##               expr     min       lq     mean  median      uq     max neval
##  sma_func_TTR(df2) 615.762 721.2705 1180.353 842.153 1091.29 16950.1  1000

Nope… almost same as SMA.

the Zoo package

sma_func_zoo <- function(df){
  df$SMA <- zoo::rollmean(df$Adjusted, k=50, align="right", fill = NA)
}

res50days_zoo <- microbenchmark(sma_func_zoo(df2), times = 1000)

res50days_zoo
## Unit: milliseconds
##               expr      min       lq    mean   median       uq      max
##  sma_func_zoo(df2) 1.443555 1.606914 2.94651 1.995526 3.265406 68.61317
##  neval
##   1000

Not better …

RcppRoll package

Could we use the RcppRoll package to do this? Yes, we can! (Miss ya Barack!)

sma_func_rcpproll <- function(df){
  df$SMA <- RcppRoll::roll_mean(df$Adjusted, n = 50, align = "right", fill = NA)
}

res50days_df2_Rcpp <- microbenchmark(sma_func_rcpproll(df2), times = 1000)

res50days_df2_Rcpp
## Unit: microseconds
##                    expr    min       lq     mean  median      uq      max
##  sma_func_rcpproll(df2) 248.52 252.4025 308.4578 275.038 311.062 8350.996
##  neval
##   1000

Woohaaa … Total fast! Thank you Kevin Ushey

RollingWindows

We found this one on github, a serendipitous encounter of sort …

#devtools::install_github("andrewuhl/RollingWindow")
sma_func_RollingWin <- function(df){
  df$SMA <- RollingWindow::RollingMean(df$Adjusted, window = 50)
}

res50days_df_RollingWin <- microbenchmark(sma_func_RollingWin(df2), times = 1000)

res50days_df_RollingWin
## Unit: microseconds
##                      expr     min       lq     mean  median       uq
##  sma_func_RollingWin(df2) 166.768 172.3195 265.9027 181.334 206.9205
##       max neval
##  11653.92  1000

Oooh my!! That even beat the RcppRoll library. Whoever you are andrewuhl awesome performance you got there!!

The Roll package

You got to love the R community and its environment … such an opulence of packages and ways to do stuff! Here is to another package, also found by chance on github. Let’s roll!

#devtools::install_github("jjf234/roll")
sma_func_roll <- function(df){
  df$SMA <- roll::roll_mean(as.matrix(df$Adjusted), width = 50)
}

res50days_df_roll <- microbenchmark(sma_func_roll(df2), times = 1000)

res50days_df_roll
## Unit: microseconds
##                expr     min       lq   mean  median       uq      max
##  sma_func_roll(df2) 474.612 530.6125 662.91 558.677 658.9765 35487.09
##  neval
##   1000

Conclusion

Thanks to those of who have leveraged the Rcpp library. It works! RollingWindows has my vote as it is slightly faster and I can just go along with the default arguments (no need to add the align and fill arguments).

Rollin’, rollin’, rollin’ on the river

All right, next we’ll find another time ways to benchmark some of the usual technical indicators from the TTR package. If they are other ways to calculate the rolling windows functions in R that I have not thought about, please let me know.

Should I mention that I am on mid 2012 MacBook Pro, 4 Gb Ram, 2.5 Ghz Intel Core i5?