forecast Package Crash Course

Author

Mario Annau

Published

November 21, 2016

Influenza (Google Trends)

# from https://www.google.com/trends/explore?cat=45&date=all&geo=AT&q=influenza
dat.raw <- read.csv("forecast_intro/Influenza.csv", header = TRUE, skip = 1, stringsAsFactors = FALSE)
dat.ts <- ts(dat.raw[, 2], frequency = 12, start = c(2004, 1))
plot(dat.ts)

dat.ts

     Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2004  47  12  30  18  16  11  11  11  10  29   9  27
2005  11 100  21  16   7   7  25  20  36  88  22   6
2006  38  26  41  25   5   5   8  10  17  16  13   8
2007  23  46  26   9   8  19   8  20  11  10  23  16
2008  60  27  12   5   8  10   6   8   8  16  13  18
2009  73  26  12  39  29  13  18  15  15  40  69  25
2010  18   8  11   5   7   6   4   3   9  12  12  12
2011  30  39  18   7   8   4   3   5   4  11  12   8
2012  10  32  29  11   5   9   3   5  10  12  11   8
2013  37  56  35  16   5   6   4   2   7  13  15  15
2014  14  26  26  12   6   8   3   5   6  12  11  13
2015  34  66  44  17   7   6   3   5   6  12  15  12
2016  26  80  41  16   6  10   2   6   7  16  21

The time series can be decomposed into three components: seasonal, trend-cycle and a residual component.

\[y_t = S_t + T_t + E_t\]

If seasonal component depends on absolute level of time series we can also use a multiplicative model

\[y_t = S_t \times T_t \times E_t\]

By taking the logarithm the additive model can be used.

dat.ts.decomp <- decompose(dat.ts, type = "additive")
plot(dat.ts.decomp)

dat.ts.noseason <- dat.ts - dat.ts.decomp$seasonal
plot(dat.ts.noseason)
lines(dat.ts.decomp$trend, col = "red")

\[ARIMA(p,d,q)\] models can be used and are specified as \[\dots\]

(p, d, q) are the AR order, the degree of differencing, and the MA order.

First we can check if timeseries is stationary. If not clear on plot or unsure, we can use the Augmented Dickey–Fuller Test or Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test

tseries::adf.test(dat.ts, alternative = "stationary")

Warning in tseries::adf.test(dat.ts, alternative = "stationary"): p-value
smaller than printed p-value


    Augmented Dickey-Fuller Test

data:  dat.ts
Dickey-Fuller = -6.2372, Lag order = 5, p-value = 0.01
alternative hypothesis: stationary

tseries::kpss.test(dat.ts)

Warning in tseries::kpss.test(dat.ts): p-value greater than printed p-value


    KPSS Test for Level Stationarity

data:  dat.ts
KPSS Level = 0.27746, Truncation lag parameter = 4, p-value = 0.1

Since no differencing is required, we set \[d=0\] and check the acf and pacf plots

acf(dat.ts, lag.max = 25)

pacf(dat.ts, lag.max = 25)

The acf plots suggests a second order autoregressive process, we thus set \[p=2\]. Let’s check the current model now:

pacf(arima(dat.ts, order= c(2, 0, 0))$residuals)

Ups - the pacf plot shows that there is still a seasonal pattern left. We try it with a seasonal order AR of \[p=1\]

pacf(arima(dat.ts, order= c(2, 0, 0), seasonal = c(1, 0, 0))$residuals)

And increase to two since we still have a seasonal effect

pacf(arima(dat.ts, order= c(2, 0, 0), seasonal = c(2, 0, 0))$residuals)

This is the same model as selected by auto.arima

mod1 <- auto.arima(dat.ts)
mod1

Series: dat.ts 
ARIMA(0,0,1)(2,0,0)[12] with non-zero mean 

Coefficients:
         ma1    sar1    sar2     mean
      0.2590  0.1580  0.4073  19.0895
s.e.  0.0798  0.0771  0.0929   2.8897

sigma^2 = 211.4:  log likelihood = -635.47
AIC=1280.95   AICc=1281.35   BIC=1296.17

fc <- forecast(mod1, 24)
plot(fc)

fc

         Point Forecast      Lo 80    Hi 80       Lo 95    Hi 95
Dec 2016      16.875426 -1.7572289 35.50808 -11.6207604 45.37161
Jan 2017      26.254943  7.0074396 45.50245  -3.1815731 55.69146
Feb 2017      47.822509 28.5750062 67.07001  18.3859935 77.25903
Mar 2017      32.698517 13.4510141 51.94602   3.2620014 62.13503
Apr 2017      17.750197 -1.4973058 36.99770 -11.6863185 47.18671
May 2017      12.096730 -7.1507734 31.34423 -17.3397861 41.53325
Jun 2017      12.321490 -6.9260133 31.56899 -17.1150260 41.75801
Jul 2017       9.835343 -9.4121604 29.08285 -19.6011731 39.27186
Aug 2017      11.282079 -7.9654241 30.52958 -18.1544369 40.71859
Sep 2017      11.847426 -7.4000774 31.09493 -17.5890901 41.28394
Oct 2017      15.713570 -3.5339327 34.96107 -13.7229454 45.15009
Nov 2017      17.725654 -1.5218497 36.97316 -11.7108624 47.16217
Dec 2017      15.851906 -3.6194988 35.32331 -13.9270380 45.63085
Jan 2018      23.036626  3.5502923 42.52296  -6.7651497 52.83840
Feb 2018      48.440334 28.9539998 67.92667  18.6385578 78.24211
Mar 2018      30.164729 10.6783956 49.65106   0.3629536 59.96651
Apr 2018      17.619441 -1.8668931 37.10577 -12.1823351 47.42122
May 2018      12.652818 -6.8335157 32.13915 -17.1489577 42.45459
Jun 2018      14.317636 -5.1686973 33.80397 -15.4841393 44.11941
Jul 2018      10.666169 -8.8201648 30.15250 -19.1356068 40.46794
Aug 2018      12.524086 -6.9622480 32.01042 -17.2776899 42.32586
Sep 2018      13.020748 -6.4655857 32.50708 -16.7810277 42.82252
Oct 2018      17.297610 -2.1887237 36.78394 -12.5041657 47.09939
Nov 2018      19.652189  0.1658554 39.13852 -10.1495866 49.45396

Company Data (AMZN)

urls <- paste("http://www.wikinvest.com/stock/Amazon.com_(AMZN)/Data/Revenue", rep(2000:2015, each=4), sprintf("Q%d", 1:4), sep = "/")
vals <- sapply(urls, function(x) {
  tree <- htmlTreeParse(x, useInternalNodes = TRUE)
  xmlValue(getNodeSet(tree, "//div[@id='nv_value']")[[1]])
})
matchvals <- gregexpr("[0-9.]+", vals)
numvals <- as.numeric(sapply(1:length(vals), function(i) substr(vals[i], matchvals[[i]][1], attr(matchvals[[i]], "match.length")[1] + 1)))
numvals <- ifelse(grepl("billion", vals), numvals * 1000, numvals)
amzn.rev <- ts(numvals, start = c(2000, 1), frequency = 4)

plot(amzn.rev)

We can easily see that should 1. Use the log scale or 2. Differentiate the series.

amzn.rev.log <- log(amzn.rev)
plot(amzn.rev.log)

Now it’s easy to see the trend and the seasonal component. We are now lazy and use auto.arima again:

mod2 <- auto.arima(amzn.rev.log)
mod2

Series: amzn.rev.log 
ARIMA(1,0,0)(2,1,0)[4] with drift 

Coefficients:
         ar1     sar1     sar2   drift
      0.8721  -0.5769  -0.2558  0.0597
s.e.  0.0692   0.1473   0.1417  0.0058

sigma^2 = 0.002051:  log likelihood = 101.54
AIC=-193.09   AICc=-191.98   BIC=-182.61

plot(forecast(mod2, 12))

mod3 <- auto.arima(amzn.rev)
mod3

Series: amzn.rev 
ARIMA(0,1,0)(2,1,0)[4] 

Coefficients:
        sar1    sar2
      0.2688  0.5529
s.e.  0.1211  0.1226

sigma^2 = 149679:  log likelihood = -436.59
AIC=879.17   AICc=879.61   BIC=885.4

fc3 <- forecast(mod3, 12)
plot(fc3)

fc3

        Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
2016 Q1       28576.30 28080.49 29072.12 27818.03 29334.58
2016 Q2       29263.61 28562.42 29964.79 28191.24 30335.98
2016 Q3       31600.70 30741.93 32459.47 30287.32 32914.07
2016 Q4       42569.82 41578.19 43561.44 41053.26 44086.37
2017 Q1       34824.38 33324.80 36323.97 32530.96 37117.80
2017 Q2       36051.12 34176.50 37925.74 33184.14 38918.10
2017 Q3       38947.31 36761.07 41133.55 35603.75 42290.88
2017 Q4       50978.86 48520.18 53437.53 47218.64 54739.08
2018 Q1       42768.06 39557.90 45978.22 37858.54 47677.57
2018 Q2       44259.96 40443.53 48076.39 38423.23 50096.69
2018 Q3       47398.85 43060.04 51737.65 40763.22 54034.47
2018 Q4       60036.20 55231.49 64840.92 52688.03 67384.38

Reading Materials

Forecasting Principles and Practice by Rob Hyndman and George Athanasopoulos