Purpose

This markdown provides the complete R code for the proposed solution of Sisifo’s page team to the MineThatData Forecasting Challenge. The challenge is to forecast the next 3 months of a target time series that contains data of sales of an unknown company for 80 months. The proposed solution uses Bayesian Structural Time Series models, through the excellent open source library bsts by Steve Scott.

This solution was published at this MineThatData blog post in Nov 15th 2017, and it got the best squared error of all solutions presented!!! (see here).

If you are interested on the theoretical justification of the solution, some background about BSTS models and the way to use them (in particular, setting up Bayesian priors and doing cross-validation for time series, i.e. forward chaining), you can check the whitepaper in pdf format that went along with the proposed solution.

Loading the data

The data can be read directly from the excel file using the code below. Note that in the original file a few cells are duplicated (the ones corresponding to the month 41) and these duplicates are removed. A simple plot of the data is also provided.

library(readxl)
library(httr)
GET("http://minethatdata.com/MineThatData_ForecastChallenge_20170914.xlsx",
    write_disk(tf <- tempfile(fileext = ".xlsx")))
## Response [http://minethatdata.com/MineThatData_ForecastChallenge_20170914.xlsx]
##   Date: 2017-12-03 17:03
##   Status: 200
##   Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
##   Size: 21.8 kB
## <ON DISK>  /var/folders/lm/6kvyzp5d3ds_20dxw939n5hw0000gp/T//RtmpTHH0nN/file4fb92f432430.xlsx
sales <-
  read_excel(tf, sheet=1,
             col_names=TRUE, col_types=rep("numeric", 17))
sales <- sales[! duplicated(sales$Month),]
plot(sales$Month, sales$Rolling_12Month_Sales, type="l")

Validation loop

How to setup the Bayesian priors for the model when you do not have clear priors? In the case of the Forecasting Challenge, there is not even any information at all about the business context. A possible strategies is to use cross-validation, but in a flavor adapted to time series, usually called forward chaining (see this cross-validated answer or this blog post).

The following code adopts the strategy for a bsts model, where:

  • data is the full dataset
  • horizon refers to the prediction horizon
  • number_of_folds i.e. number of loops of forward chaining
  • ss.function is a function for defining the state.specification of the bsts model; a function is needed since the state.specification component needs to be trained (it extracts some parameters from the available data, while the training data changes in every iteration of the loop)
  • niter, seed and burn: see ?bsts
  • the rest are for debugging and visualization

The loop implements 3 measures, the root mean square deviation, and some other measures commonly used in time series prediction, mape and mase.

library(bsts)
# cross-validation loop
bsts.cv.loop <- function(data,
                         horizon,
                         number_of_folds,
                         ss.function,
                         niter=1000,
                         seed=1232,
                         burn=250,
                         do.plot=TRUE,
                         verbose=TRUE,
                         debug=TRUE) {
  rmse_v <- c()
  mape_v <- c()
  mase_v <- c()
  for (fold in 1:number_of_folds) {
    # construct data_train/data_test
    l <- length(data) - fold*horizon
    if (debug) print(l)
    data_train <- data[1:l]
    data_test <- data[(l+1):(l+horizon)]
    # fit model & predict
    model <- bsts(data_train,
                  state.specification=ss.function(data_train),
                  niter=1000,
                  seed=1232,
                  ping=0)
    pred <- predict(model, horizon=horizon, burn=burn)
    if (do.plot) {
      # plot
      plot(pred, plot.original = 36)
      lines((l+1):(l+horizon), data_test, col="red", type="l")
    }
    # evaluation
    errors <- data_test-pred$mean
    rmse <- sqrt(mean(errors^2))
    rmse_v <- c(rmse_v, rmse)
    mape <- mean(abs(errors)/data_test)*100
    mape_v <- c(mape_v, mape)
    naive_prediction_errors <- diff(data, 1)
    mase <- mean(abs(errors)) / mean(abs(naive_prediction_errors))
    mase_v <- c(mase_v, mase)
    if (verbose) print(paste0("fold ", fold, ": mape ", mape, " / mase ", mase))
  }
  return(data.frame(rmse=rmse_v,mape=mape_v, mase=mase_v))
}

Initialization

# 0. setup
# 0.1. target series to predict, ordered by month
target <- sales$Rolling_12Month_Sales[order(sales$Month)]
plot(target, type="l")

# 0.2. general parameters for the prediction
horizon=3 # as in Hillstron's challenge
number_of_folds=8 # reserves around 70% of the data for training only

Selecting the right trend component and its parameters

After implementing the cross-validation loop, selecting the best trend component, together with its parameters, is a matter of brute force.

In the following loop, the validation is done for the local level trend component, which is the simplest and has only a parameter to set.

# 1.1. fit "local level" and set best prior values
res_df1 <- c()
for (sigma_guess in c(0.01, 0.1, 0.5)) {
  res <-
    bsts.cv.loop(target,
                 horizon=horizon,
                 number_of_folds=number_of_folds,
                 ss.function=function(data_train) {
                   sdy <- sd(data_train)
                   sd.prior <- SdPrior(sigma.guess=sigma_guess*sdy,
                                       upper.limit=sdy,
                                       sample.size=16)
                   return(AddLocalLevel(list(),
                                        data_train,
                                        sigma.prior=sd.prior))
                 },
                 debug=FALSE,
                 verbose=FALSE)
  print(paste0("sigma_guess ", sigma_guess,
               ": mean(rmse) ", mean(res$rmse),
               " / mean(mape) ", mean(res$mape),
               " / mean(mase) ", mean(res$mase)))
  res_row <- data.frame(sigma_guess=sigma_guess,
                        mean_rmse=mean(res$rmse),
                        mean_mape=mean(res$mape),
                        mean_mase=mean(res$mase))
  res_df1 <- rbind(res_df1, res_row)
}
res_df1
write.csv(res_df1, file="cv_res_local_level.csv",
          row.names=FALSE)

This is the result. Differences of the fit for the values of the parameters are quite minimal.

res_df1
##   sigma_guess mean_rmse mean_mape mean_mase
## 1        0.01   1113079  1.296845 0.9144091
## 2        0.10   1113191  1.296731 0.9143498
## 3        0.50   1121630  1.302850 0.9187451

In a second loop, the local linear trend component is tested. In this one, two parameters are set in a nested loop.

# 1.2. fit "local linear" and set best prior values
res_df2 <- c()
for (level_sigma_guess in c(0.01, 0.1, 0.5)) {
  for (slope_sigma_guess in c(0.01, 0.1, 0.5)) {
    res <-
      bsts.cv.loop(target,
                   horizon=horizon,
                   number_of_folds=number_of_folds,
                   ss.function=function(data_train) {
                     sdy <- sd(data_train)
                     sigma.prior <- SdPrior(sigma.guess=level_sigma_guess*sdy,
                                            upper.limit=sdy,
                                            sample.size=16)
                     slope.sigma.prior <- SdPrior(sigma.guess=slope_sigma_guess*sdy,
                                                  upper.limit=sdy,
                                                  sample.size=16)
                     return(AddLocalLinearTrend(list(),
                                                data_train,
                                                level.sigma.prior=sigma.prior,
                                                slope.sigma.prior=slope.sigma.prior))
                   },
                   debug=FALSE,
                   verbose=FALSE)
    print(paste0("level_sigma_guess ", level_sigma_guess, " / slope_sigma_guess ", slope_sigma_guess,
                 ": mean(rmse) ", mean(res$rmse),
                 " / mean(mape) ", mean(res$mape),
                 " / mean(mase) ", mean(res$mase)))
    res_row <- data.frame(level_sigma_guess=level_sigma_guess,
                          slope_sigma_guess=slope_sigma_guess,
                          mean_rmse=mean(res$rmse),
                          mean_mape=mean(res$mape),
                          mean_mase=mean(res$mase))
    res_df2 <- rbind(res_df2, res_row)
  }
}
res_df2
write.csv(res_df2, file="cv_res_local_linear_level.csv",
          row.names=FALSE)

Note that this component works consistently worse than the local level for the series in hand.

res_df2
##   level_sigma_guess slope_sigma_guess mean_rmse mean_mape mean_mase
## 1              0.01              0.01   1364297  1.481599 1.0443783
## 2              0.01              0.10   1703694  1.840721 1.3013977
## 3              0.01              0.50   1744992  1.929831 1.3665115
## 4              0.10              0.01   1211363  1.331118 0.9428945
## 5              0.10              0.10   1577616  1.690251 1.1950030
## 6              0.10              0.50   1737569  1.918044 1.3579408
## 7              0.50              0.01   1244229  1.361862 0.9660299
## 8              0.50              0.10   1230068  1.311644 0.9268897
## 9              0.50              0.50   1624604  1.758403 1.2439698

Finally, the semi-local linear is tried. This is the most complex trend component available in the library, and has 4 parameters to set.

res_df3 <- c()
for (level_sigma_guess in c(0.01, 0.1, 0.5)) {
  for (slope_mean_guess in c(0.01, 0.1, 0.5)) {
    for (slope_sigma_guess in c(0.01, 0.1, 0.5)) {
      for (ar_sigma_guess in c(0.01, 0.1, 0.5, 1)) {
        res <-
          bsts.cv.loop(target,
                       horizon=horizon,
                       number_of_folds=number_of_folds,
                       ss.function=function(data_train) {
                         sdy <- sd(data_train)
                         sigma.prior <- SdPrior(sigma.guess=level_sigma_guess*sdy,
                                                upper.limit=sdy,
                                                sample.size=16)
                         slope.mean.prior <- NormalPrior(mu=0,
                                                         sigma=slope_mean_guess*sdy)
                         slope.ar1.prior <- Ar1CoefficientPrior(mu=0,
                                                                sigma=1*ar_sigma_guess)
                         slope.sigma.prior <- SdPrior(sigma.guess=slope_sigma_guess*sdy,
                                                      upper.limit=sdy,
                                                      sample.size=16)
                         return(AddSemilocalLinearTrend(list(),
                                                        data_train,
                                                        level.sigma.prior=sigma.prior,
                                                        slope.mean.prior=slope.mean.prior,
                                                        slope.ar1.prior=slope.ar1.prior,
                                                        slope.sigma.prior=slope.sigma.prior))
                       },
                       debug=FALSE,
                       verbose=FALSE)
        print(paste0("level_sigma_guess ", level_sigma_guess,
                     " / slope_mean_guess ", slope_mean_guess,
                     " / slope_sigma_guess ", slope_sigma_guess,
                     " / ar_sigma_guess ", ar_sigma_guess,
                     ": mean(rmse) ", mean(res$rmse),
                     " / mean(mape) ", mean(res$mape),
                     " / mean(mase) ", mean(res$mase)))
        res_row <- data.frame(level_sigma_guess=level_sigma_guess,
                              slope_mean_guess=slope_mean_guess,
                              slope_sigma_guess=slope_sigma_guess,
                              ar_sigma_guess=ar_sigma_guess,
                              mean_rmse=mean(res$rmse),
                              mean_mape=mean(res$mape),
                              mean_mase=mean(res$mase))
        res_df3 <- rbind(res_df3, res_row)
      }
    }
  }
}
res_df3
write.csv(res_df3, file="cv_res_semi_local_linear_level.csv",
          row.names=FALSE)

The following are the values around the minimum. Note that there are only slightly better than the ones for the local level component; anyhow, the prediction with horizon 3 is going to be much more sensible for the semi-local linear, since the series is modeled as trend only.

res_df3[res_df3$mean_mape <= min(res_df3$mean_mape)*1.01 &
          res_df3$mean_mape >= min(res_df3$mean_mape)*0.99,]
##    level_sigma_guess slope_mean_guess slope_sigma_guess ar_sigma_guess
## 14              0.01             0.10              0.01            0.1
## 26              0.01             0.50              0.01            0.1
## 39              0.10             0.01              0.01            0.5
## 40              0.10             0.01              0.01            1.0
## 51              0.10             0.10              0.01            0.5
## 63              0.10             0.50              0.01            0.5
## 64              0.10             0.50              0.01            1.0
##    mean_rmse mean_mape mean_mase
## 14   1102451  1.204493 0.8541107
## 26   1105388  1.205426 0.8548579
## 39   1111522  1.204680 0.8514994
## 40   1110975  1.204386 0.8512959
## 51   1102954  1.211028 0.8554464
## 63   1104017  1.211963 0.8562478
## 64   1102445  1.210628 0.8552218

Chosen model

This is a final fit of the chosen model, with the trend component and its parameters as selected in the section above, for obtaining the predictions.

sdy <- sd(target)
level.sigma.prior <- SdPrior(sigma.guess=0.01*sdy,
                       upper.limit=sdy,
                       sample.size=16)
slope.ar1.prior <- Ar1CoefficientPrior(mu=0,
                                       sigma=0.1)
slope.mean.prior <- NormalPrior(mu=0,
                                sigma=0.1*sdy)
slope.sigma.prior <- SdPrior(sigma.guess=0.01*sdy,
                             upper.limit=sdy,
                             sample.size=16)
state.specification <- AddSemilocalLinearTrend(list(),
                                               target,
                                               level.sigma.prior=level.sigma.prior,
                                               slope.ar1.prior=slope.ar1.prior,
                                               slope.mean.prior=slope.mean.prior,
                                               slope.sigma.prior=slope.sigma.prior)
chosen_model <- bsts(target,
                     state.specification=state.specification,
                     niter=1000,
                     seed=1232)
pred <- predict(chosen_model, horizon = 3, burn = 250)

This is a simple analysis of the residuals; results look reasonable. There’s a very weak annual seasonality, already detected in the exploratory analysis, which is not captured by the trend-only model.

one_step_errors <- colMeans(chosen_model$one.step.prediction.errors)
plot(one_step_errors, type="l")

acf(one_step_errors)

And finally this is a plot of the predictions; certainty margins are quite wide, but still it looks reasonable for a trend-only model.

plot(pred, plot.original = 80)

pred$mean
## [1] 81532403 81236545 81002410




If you were interested on the above, try reading this related article, which uses prophet instead of BSTS for solving the prediction problem...