• Home
  • Posts
  • Home
  • Posts

Time Series Analysis Part 5 – Oxford Temperature

June 11, 2022 Linear Regression Regression Statistics Time Series

Over the past 4 articles the focus was on applying time series techniques to generated time series data. The last article walked through a build up of a multiple linear regression model while incorporating time lags and trend as features. This resulted in a model with MSE comparable to the SARIMA models we’ve been working with.

Here, for the first time in this series of articles we’ll be working with real life data. In particular the monthly maximum temperature from the Oxford Temperature Data Since 1853 (from the Met Office).

This is what this data looks like:

It’s quite difficult to see seasonality here or anything we can relate to. This is because we are visualising too much information. Here’s the same data but for the last 5 years:

There is probably a 12 month seasonality here. Over the past 5 years it looks like the max temperature has been ranging from about 7 to 25 degrees.

The data has a total of 2032 data points which we will split into Train (1828) and Test (204).

SARIMA

If we look at the ACF plot (autocorrelogram) we can see a seasonality of 12 months:

Differencing the data, removing seasonality and plotting the ACF and PACF we see evidence of ARMA(1,2)

The seasonality component has an ARMA(1,3) from the seasonality ACF and PACF plots below

The ACF and the PACF of the seasonality component. The ACF here is telling us that the first seasonality lag (lag 12) and the kth seasonality lag (lag 12*k) are not correlated for k > 3 (or k > 4). The PACF is telling us that once you’ve controlled for the first seasonality lag, the second one doesn’t provide much information.

Performing a stepwise search of the SARIMA parameters that have the best out of bag MSE on a hold out set starting with SARIMA(1,0,3)(1,0,3,12) we arrive at the best performing model of SARIMA(3,1,1)(1,0,1,12). Note that the hold out set here is not the Test set of 204 data points we set aside at the beginning:

A stepwise search minimising the MSE on a hold out set of 100 datapoints

This model has an average MSE of 2.8 on 10-fold time series cross validations (same approach as seen in Time Series Analysis Part 3 – Assessing Model Fit). For each fold the model is asked to predict on data it has not seen before (i.e. the original test set we set aside). The fit on the Test set looks like this:

SARIMA(3,1,1)(1,0,1,12) predictions

Linear Regression

Here we loop through different numbers of lags and seasonal components in a regression and identify the features that results in the lowest MSE on a hold-out set. After doing this we find that the optimal parameters for a regression is a lag of 2 and seasonal lag 1:

The regression output is shown below (only coefficients that are significant are retained to improve generalisability):

The fit on the test set of this model doesn’t look too bad:

The predicted time series using a Linear Regression fit on the data before the orange line. Note that these predictions are forecasts. This means that we predicting the next time point which then becomes the next lag1 value.

However, the time series 10-fold cross-validated average MSE over the test set is 4.3. This is a little higher than the 2.8 obtained using SARIMA.

Next up, we continue the journey by entering the realm of Neural Networks…

Time Series Analysis Part 4 - Multiple Linear Regression

My Profile Links

  • LinkedIn
  • GitHub

Recent Posts

  • Time Series Analysis Part 5 – Oxford Temperature
  • Time Series Analysis Part 4 – Multiple Linear Regression
  • Time Series Analysis Part 3 – Assessing Model Fit
  • Time Series Analysis Part 2 – Forecasting
  • Time Series Analysis 1 – Identifying Structure

Recent Comments

    Archives

    • June 2022
    • May 2022
    • April 2022
    • January 2022
    • June 2020
    • February 2020
    • July 2019
    • May 2019
    • April 2019

    Categories

    • Classification
    • Convolutional Neural Networks
    • Difference Equations
    • Image Classification
    • Linear Regression
    • Mathematics
    • Neural Networks
    • Python
    • Regression
    • Ridge Regression
    • Statistics
    • Text Analysis
    • Time Series
    • Web Scraping

    Meta

    • Log in
    • Entries feed
    • Comments feed
    • WordPress.org

    Profiles

    • LinkedIn
    • GitHub

    Categories