Time Series Analysis
Colab
Introduction to Time Series
Definition and Examples
- A time series is a series of data points or observations recorded at different or regular time intervals, e.g., hourly, daily, monthly, quarterly, yearly, etc.
- Time Series Analysis is the process of analyzing time series data to extract meaningful statistics and other characteristics of the data such as trends, cycles, and seasonal patterns.
- Time-Series Forecasting is the process of using statistical models to predict future values based on past values.
Applications
- Finance: Time series analysis is used for stock price prediction, risk management, etc.
- Economics: It helps in understanding and forecasting economic indicators, GDP growth, and inflation rates.
- Meteorology: Time series data is crucial for weather forecasting, climate change studies, and analyzing seasonal patterns.
- Healthcare: Time series analysis is used for analyzing patient data, monitoring disease outbreaks, and predicting patient outcomes.
- Marketing: Time series analysis is used for analyzing sales data, customer behavior, and predicting future trends.
- Manufacturing: Time series analysis is used for analyzing production data, monitoring equipment performance, and predicting maintenance needs.
Components of a Time Series
- Trend: Long-term movement in the data. “Are car model sales going up or down?”
- Seasonality: Regular pattern repeating over a known, fixed period. “Do sales of ice cream increase during the summer?”
- Acyclic: Long-term oscillations not of a fixed period. “Are there long term business cycles in the sales of a product?”
- Irregular/Noise: Random variation. “Are there random variations in the daily temperature that are not explained by the trend or seasonality?”
Distinction between Time series and Other Types of Data
- Time Series Data: Data points collected or recorded at specific time intervals. Examples include stock prices, weather data, and sales figures.
- Cross-Sectional Data: Data collected at a single point in time, representing a snapshot. Examples include survey results and census data.
- Panel Data: A combination of time series and cross-sectional data, where multiple subjects are observed over time. Examples include longitudinal studies and repeated measures data.
Analysis and Forecasting
We’ll look at two main aspects of time series analysis:
- Analyzing historical data to answer questions about past behavior
- Forecasting future values based on past values
Time and Date Manipulation
Many time series data sets are indexed by date or time. The python datetime
library and the pandas
library provide a powerful set of tools for manipulating time series data.
The Time Series chapter of the book Python for Data Analysis, 3rd Ed. provides a good overview of these tools. We’ll share a few excerpts here.
Code
import numpy as np
import pandas as pd
from datetime import datetime
= datetime.now()
now print(f"Date and time when this cell was executed: {now}")
print(f"Year: {now.year}, month: {now.month}, day: {now.day}")
= now - datetime(2024, 1, 1)
delta print(f"Since beginning of 2024 till when this cell was run there were {delta.days} days and {delta.seconds} seconds.")
Date and time when this cell was executed: 2024-11-21 11:41:16.594685
Year: 2024, month: 11, day: 21
Since beginning of 2024 till when this cell was run there were 325 days and 42076 seconds.
You can also convert between strings and datetime.
Code
# string to datetime
= "2024-01-01"
date_string = datetime.strptime(date_string, "%Y-%m-%d")
date_object print(date_object)
2024-01-01 00:00:00
You can also format datetime objects as strings.
Code
# datetime to string
= now.strftime("%Y-%m-%d")
now_str print(now_str)
2024-11-21
See Table 11.2 in the book for a list of formatting codes.
Let’s explore some of the pandas time series tools.
Create a time series with a datetime index.
Code
= pd.Series(np.random.standard_normal(1000),
longer_ts =pd.date_range("2022-01-01", periods=1000))
indexprint(type(longer_ts))
longer_ts
<class 'pandas.core.series.Series'>
2022-01-01 -0.118239
2022-01-02 -0.392189
2022-01-03 1.323854
2022-01-04 -0.385299
2022-01-05 0.269278
...
2024-09-22 1.028256
2024-09-23 -0.351312
2024-09-24 0.220405
2024-09-25 0.877292
2024-09-26 -0.662885
Freq: D, Length: 1000, dtype: float64
We can access just the samples from 2023 with simply:
Code
"2023"] longer_ts[
2023-01-01 -1.719296
2023-01-02 -0.296890
2023-01-03 -2.154459
2023-01-04 -0.723442
2023-01-05 -0.464635
...
2023-12-27 -1.504241
2023-12-28 -1.560469
2023-12-29 -0.154077
2023-12-30 -0.049873
2023-12-31 -1.470796
Freq: D, Length: 365, dtype: float64
Or the month of September 2023:
Code
"2023-09"] longer_ts[
2023-09-01 0.519380
2023-09-02 0.593440
2023-09-03 0.642655
2023-09-04 0.810309
2023-09-05 0.630825
2023-09-06 -0.304449
2023-09-07 0.932837
2023-09-08 -0.732838
2023-09-09 -1.300680
2023-09-10 -0.352851
2023-09-11 0.219465
2023-09-12 -1.231226
2023-09-13 1.018619
2023-09-14 1.838647
2023-09-15 1.179987
2023-09-16 -1.092319
2023-09-17 0.154880
2023-09-18 1.362092
2023-09-19 0.103588
2023-09-20 -1.771907
2023-09-21 -0.009657
2023-09-22 -0.264263
2023-09-23 0.343354
2023-09-24 0.378273
2023-09-25 0.061034
2023-09-26 0.762233
2023-09-27 -0.958480
2023-09-28 -1.800196
2023-09-29 -0.284326
2023-09-30 -0.011528
Freq: D, dtype: float64
Or slice by date range:
Code
"2023-03-01":"2023-03-10"] longer_ts[
2023-03-01 -1.458369
2023-03-02 0.167586
2023-03-03 2.235363
2023-03-04 0.803110
2023-03-05 2.225331
2023-03-06 1.902983
2023-03-07 0.805966
2023-03-08 -1.534318
2023-03-09 2.051011
2023-03-10 -0.520951
Freq: D, dtype: float64
or:
Code
"2023-09-15":] longer_ts[
2023-09-15 1.179987
2023-09-16 -1.092319
2023-09-17 0.154880
2023-09-18 1.362092
2023-09-19 0.103588
...
2024-09-22 1.028256
2024-09-23 -0.351312
2024-09-24 0.220405
2024-09-25 0.877292
2024-09-26 -0.662885
Freq: D, Length: 378, dtype: float64
There are many more time series tools available that let you do things like:
- Shifting and setting frequencies of date ranges
- Time zone handling
- Time series resampling
- Time series rolling and expanding windows
Moving Window Functions
Let’s dive into the moving window functions.
Code
import pandas as pd
import yfinance as yf
# Load 10 years of AAPL stock prices into a dataframe
= yf.download('AAPL', start='2012-01-01', end='2022-01-01')
aapl_data print(aapl_data.head())
= aapl_data['Close'] aapl_close_px
[*********************100%%**********************] 1 of 1 completed
Open High Low Close Adj Close Volume
Date
2012-01-03 14.621429 14.732143 14.607143 14.686786 12.388999 302220800
2012-01-04 14.642857 14.810000 14.617143 14.765714 12.455576 260022000
2012-01-05 14.819643 14.948214 14.738214 14.929643 12.593858 271269600
2012-01-06 14.991786 15.098214 14.972143 15.085714 12.725511 318292800
2012-01-09 15.196429 15.276786 15.048214 15.061786 12.705329 394024400
Code
# Plot the closing prices
= aapl_close_px.plot(label='AAPL')
plt =250).mean().plot(label='250d MA')
aapl_close_px.rolling(window plt.legend()
Visualization
Like with any data, it is important to visualize time series data to get a sense of various characteristics.
We’ll show a few examples of time series plots and include the code to generate them.
Air Passengers Dataset
We’re going to use a dataset of air passengers per month from 1949 to 1960.
Code
= os.path.join('data', 'air_passengers_1949_1960.csv')
path = pd.read_csv(path, index_col='Date', parse_dates=True)
air_passengers air_passengers.head()
Number of Passengers | |
---|---|
Date | |
1949-01-01 | 112 |
1949-02-01 | 118 |
1949-03-01 | 132 |
1949-04-01 | 129 |
1949-05-01 | 121 |
Time Series Plot
Let’s look at the time series plot.
Code
= air_passengers['Number of Passengers']
ts ='Number of Passengers', title='Air Passengers 1949-1960', figsize=(10, 4)) ts.plot(ylabel
Clearly there are some trends and seasonality in the data.
One way to accentuate that is to use a two-sided plot.
Code
import matplotlib.pyplot as plt
= air_passengers.index.values
x = air_passengers['Number of Passengers'].values
y1
= plt.subplots(1, 1, figsize=(10, 4))
fig, ax =y1, y2=-y1, color='tab:blue', alpha=0.2)
plt.fill_between(x, y1-800, 800)
plt.ylim('Air Passengers (Two-Sided View)')
plt.title(=0, xmin=x[0], xmax=x[-1], color='black', linewidth=0.5)
plt.hlines(y plt.show()
Since there is a clear seasonal pattern in the data, we can plot the data by month.
Code
# Seasonal plot of air_passengers
import seaborn as sns
# Extract month and year from the index
'Month'] = air_passengers.index.month
air_passengers['Year'] = air_passengers.index.year
air_passengers[
# Create a seasonal plot
=(10, 4))
plt.figure(figsize=air_passengers, x='Month', y='Number of Passengers', hue='Year', palette='tab10')
sns.lineplot(data'Seasonal Plot of Air Passengers')
plt.title('Number of Passengers')
plt.ylabel('Month')
plt.xlabel(='Year', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.legend(title plt.show()
The seasonality looks like it is increasing over time, but relatively speaking that might not be the case. We can normalize the data by the first month of each year.
Code
# Normalize the number of passengers by the first month of each year
'Normalized_Passengers'] = air_passengers.groupby('Year')['Number of Passengers'].transform(lambda x: x / x.iloc[0])
air_passengers[
# Create a seasonal plot with normalized values
=(10, 4))
plt.figure(figsize=air_passengers, x='Month', y='Normalized_Passengers', hue='Year', palette='tab10')
sns.lineplot(data'Seasonal Plot of Air Passengers (Normalized by First Month of Each Year)')
plt.title('Normalized Number of Passengers')
plt.ylabel('Month')
plt.xlabel(='Year', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.legend(title plt.show()
Now the seasonality looks more similar across years, but perhaps it is still increasing a bit over time.
We can also look at the year-wise box plot.
Code
# Year-wise box plot of air passengers
=(10, 4))
plt.figure(figsize=air_passengers, x='Year', y='Number of Passengers', palette='tab10')
sns.boxplot(data'Year-wise Box Plot of Air Passengers')
plt.title('Number of Passengers')
plt.ylabel('Year')
plt.xlabel( plt.show()
/var/folders/ly/jkydg4dj2vs93b_ds7yp5t7r0000gn/T/ipykernel_26421/958168077.py:3: FutureWarning:
Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.
sns.boxplot(data=air_passengers, x='Year', y='Number of Passengers', palette='tab10')
You can see the trends, median, and interquartile range of the data by year.
We can also look at the seasonal subseries box plot that indicates the distribution of the data over the years for each month.
Code
# Draw seasonal subseries plots
import matplotlib.pyplot as plt
import seaborn as sns
# Create a subseries plot for each month
=(10, 4))
plt.figure(figsize=air_passengers, x='Month', y='Number of Passengers', palette='tab10', hue='Month', legend=False)
sns.boxplot(data'Seasonal Subseries Plot of Air Passengers')
plt.title('Number of Passengers')
plt.ylabel('Month')
plt.xlabel( plt.show()
We can also look at the seasonal subseries plot that shows the trend over the years for each month.
Code
# Create seasonal subseries plot of monthly passengers
= plt.subplots(1, 12, figsize=(10, 4), sharey=True)
fig, axes 'Seasonal Subseries Plot of Monthly Air Passengers')
fig.suptitle(
# List of abbreviated month names
= ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
month_names
# Iterate over each month and create a subplot
for i, ax in enumerate(axes.flatten(), start=1):
= air_passengers[air_passengers['Month'] == i]
monthly_data 'Year'], monthly_data['Number of Passengers'], marker='.')
ax.plot(monthly_data[-1])
ax.set_title(month_names[i# Remove x-axis numbers
ax.set_xticks([]) # ax.set_xlabel('Year')
#ax.set_ylabel('Number of Passengers')
=[0, 0.03, 1, 0.95])
plt.tight_layout(rect plt.show()
Lag Plot
Lag plots are scatter plots of the data shifted by a certain lag against the original data.
Code
from pandas.plotting import lag_plot
= plt.subplots(3, 4, figsize=(15, 10))
fig, axes 'Lag Plots of Air Passengers')
fig.suptitle(
for lag in range(1, 13):
= axes[(lag-1) // 4, (lag-1) % 4]
ax 'Number of Passengers'], lag=lag, ax=ax)
lag_plot(air_passengers[f'Lag {lag}')
ax.set_title(
=[0, 0.03, 1, 0.95])
plt.tight_layout(rect plt.show()
Question: What do you observe about the lag plots?
Autocorrelation Function
For the next visualization, we will use the autocorrelation function (ACF).
- The autocorrelation function (ACF) measures the correlation between a time series and its lagged values.
- It helps to identify the extent to which current values of the series are related to its past values.
- The ACF is useful for identifying patterns such as seasonality and for determining the appropriate parameters for time series models like ARIMA.
ACF Definition
The autocorrelation function at lag \(k\) is defined as:
\[ \rho_k = \frac{\sum_{t=k+1}^{T} (y_t - \bar{y})(y_{t-k} - \bar{y})}{\sum_{t=1}^{T} (y_t - \bar{y})^2}, \]
where
- \(y_t\) is the value of the time series at time \(t\),
- \(\bar{y}\) is the mean of the time series,
- \(T\) is the total number of observations,
- \(k\) is the lag.
Interpretation:
- Peaks in Autocorrelation: Indicate that there are cyclic components.
- No Autocorrelation: Indicates that the values of the series are independent of each other.
ACF of Air Passengers
An interesting perspective is to plot the autocorrelation to see if there are any significant lags in the data.
Code
from statsmodels.graphics.tsaplots import plot_acf
# Draw the autocorrelation plot of air passengers
'Number of Passengers'], lags=48)
plot_acf(air_passengers['Autocorrelation Plot of Air Passengers')
plt.title('Lags')
plt.xlabel('Autocorrelation')
plt.ylabel( plt.show()
The blue shaded area is the 95% confidence interval. The autocorrelation is significant when it is outside the confidence interval, e.g., the peak at 12.
Although we can see peaks at lags 24, 36, and 48, they are not statistically significant.
Components of Time Series
Let’s now try to decompose the time series into constituent components.
First we’ll define the components and then look at how we find them.
Trend
- Trend: Represents the long-term progression of the series. It can be increasing, decreasing, or constant over time.
- Examples:
- An upward trend in stock prices over several years.
- A downward trend in the sales of a product as it becomes obsolete.
Seasonality
- Seasonality: Refers to regular, predictable changes that recur every calendar year. It is a pattern that repeats over a known, fixed period.
- Examples:
- Increased retail sales during the holiday season.
- Higher ice cream sales during the summer months.
- Regular fluctuations in electricity consumption due to seasonal temperature changes.
Cyclic (Acyclic)
- Cyclic: Refers to long-term oscillations or fluctuations in the data that are not of a fixed period. These cycles can vary in length and are often influenced by economic or business conditions.
- Examples:
- Business cycles with periods of expansion and contraction.
- Agricultural cycles influenced by factors such as weather and market conditions.
Irregular/Noise
- Irregular/Noise: Refers to the random variation in the data that cannot be attributed to trend, seasonality, or cyclic patterns. It is the residual variation after accounting for other components.
- Examples:
- Sudden spikes or drops in stock prices due to unexpected news.
- Random fluctuations in daily temperature readings.
- Unpredictable changes in sales figures due to unforeseen events.
Time Series Decomposition
Let’s look at how we can decompose the time series into its components.
Combination of components
We’ve identified the components of the time series: trend, seasonality, noise…
There are two main ways to think about how these components combine:
- Additive model
- Multiplicative model
Additive model
- Additive Model: Assumes that the components of the time series (trend, seasonality, and noise) are added together. The model can be represented as
- \(Y(t) = T(t) + S(t) + e(t)\)
- Where \(Y(t)\) is the observed value, \(T(t)\) is the trend component, \(S(t)\) is the seasonal component, and \(e(t)\) is the noise or error term.
- Example:
- Monthly sales data where the seasonal effect is constant over time.
Multiplicative model
- Multiplicative Model: Assumes that the components of the time series are multiplied together. The model can be represented as:
- \(Y(t) = T(t) \times S(t) \times e(t)\)
- Where \(Y(t)\), \(T(t)\) and \(S(t)\) are the trend, seasonal, and noise components respectively.
- Examples
- Monthly sales data where the seasonal effect increases or decreases proportionally with the trend.
Decomposition Approaches
We’ll look at two approaches to decomposing the time series:
- Classical Decomposition
- STL Decomposition
Both approaches support additive and multiplicative models.
Classical Decomposition
- This technique involves breaking down a time series into its trend, seasonal, and irregular (or noise) components.
- It can be applied using either an additive or multiplicative model.
- Steps:
- Estimate the trend component by applying a moving average.
- Remove the trend component to get the detrended series.
- Estimate the seasonal component from the detrended series.
- Remove the seasonal component to get the irregular component.
Let’s take each step by step.
Estimating the Trend
First we’ll estimate the trend component by applying a moving average using the Pandas rolling
method.
Code
=(8, 3), label='Monthly')
ts.plot(figsize=11, center=True).mean().plot(label='11m MA')
ts.rolling(window'Number of Passengers')
plt.ylabel('Air Passengers 1949-1960')
plt.title(
plt.legend() plt.show()
We’re actually using a 11-month moving average because we want to center the average.
Removing the Trend
Now we can subtract the trend component to get the detrended series.
Code
= ts - ts.rolling(window=11, center=True).mean()
detrended_ts =(8, 3), label='Detrended')
detrended_ts.plot(figsize'Number of Passengers')
plt.ylabel('Air Passengers 1949-1960')
plt.title(
plt.legend() plt.show()
Estimating Seasonality
Estimate the seasonal component by taking the mean of the detrended series for each month.
Code
= detrended_ts.groupby(detrended_ts.index.month).mean()
seasonal_ts =(8, 3), label='Seasonal')
seasonal_ts.plot(figsize'Number of Passengers')
plt.ylabel('Air Passengers 1949-1960')
plt.title(
plt.legend() plt.show()
We can also look at the box plot of the detrended series for each month.
Code
# Box plot of the detrended series for each month
import seaborn as sns
# Create a DataFrame with the detrended series and the corresponding month
= pd.DataFrame({'Detrended': detrended_ts, 'Month': detrended_ts.index.month})
detrended_df
# Plot the box plot
=(8, 4))
plt.figure(figsize='Month', y='Detrended', data=detrended_df, palette='tab10', hue='Month', legend=False)
sns.boxplot(x'Month')
plt.xlabel('Detrended Number of Passengers')
plt.ylabel('Box Plot of Detrended Air Passengers by Month')
plt.title( plt.show()
Clearly, certain months have more variability than others.
Removing Seasonality
Now we can subtract the seasonal component to get the irregular component.
# Broadcast the seasonal component to the length of the detrended series
# and subtract it from the detrended series
= detrended_ts - np.tile(seasonal_ts, len(detrended_ts) // len(seasonal_ts)) irregular_ts
And plot the residual irregular component.
Code
=(8, 3))
plt.figure(figsize='Irregular', color='red')
irregular_ts.plot(label'Irregular')
plt.ylabel('Irregular Component')
plt.title(
plt.legend() plt.show()
Question: What do you observe about the irregular component? Is it truly random?
Finally, we can put the trend, detrended, seasonal, and irregular plots together.
Code
# Plot detrended, seasonal, and irregular components
=(8, 6))
plt.figure(figsize
# Plot the trend series
4, 1, 1)
plt.subplot(=11, center=True).mean().plot(label='Trend', color='black')
ts.rolling(window'Trend')
plt.ylabel('Trend Component')
plt.title(
plt.legend()
# Plot detrended series
4, 1, 2)
plt.subplot(='Detrended', color='blue')
detrended_ts.plot(label'Detrended')
plt.ylabel('Detrended Component')
plt.title(
plt.legend()
# Plot seasonal series
4, 1, 3)
plt.subplot(='Seasonal', color='green')
seasonal_ts.plot(label'Seasonal')
plt.ylabel('Seasonal Component')
plt.title(
plt.legend()
# Plot irregular series
4, 1, 4)
plt.subplot(='Irregular', color='red')
irregular_ts.plot(label'Irregular')
plt.ylabel('Irregular Component')
plt.title(
plt.legend()
plt.tight_layout() plt.show()
Seasonality and Trend Decomposition (STL)
- STL is a more flexible and robust method for decomposing time series data.
- It uses locally weighted regression (Loess) to estimate the trend and seasonal components (Cleveland et al. 1990).
- Advantages:
- Handles any type of seasonality (e.g., weekly, monthly).
- Can deal with missing values and outliers.
- Provides smooth and adaptable trend and seasonal components.
See Course Notes for a more detailed description of Loess.
See also (Hyndman and Athanasopoulos 2021).
Locally Estimated Scatterplot Smoothing – Loess1
Loess, which stands for “Locally Estimated Scatterplot Smoothing,” is a non-parametric method used to estimate non-linear relationships in data. It is particularly useful for smoothing scatterplots and is a type of local regression.
Key Features of Loess:
Local Fitting: Loess fits simple models to localized subsets of the data to build up a function that describes the deterministic part of the variation in the data, point by point.
Weighted Least Squares: It uses weighted least squares to fit a polynomial surface to the data. The weights decrease with distance from the point of interest, giving more influence to points near the target point.
Flexibility: Loess is flexible and can model complex relationships without assuming a specific global form for the data. It can adapt to various shapes and patterns in the data.
Smoothing Parameter: The degree of smoothing is controlled by a parameter, often denoted as \(\alpha\) or the span. This parameter determines the proportion of data points used in each local fit. A smaller span results in a curve that follows the data more closely, while a larger span results in a smoother curve.
Polynomial Degree: Loess can fit either linear or quadratic polynomials to the data. The choice of polynomial degree affects the smoothness and flexibility of the fit.
How Loess Works:
- Step 1: For each point in the dataset, a neighborhood of points is selected based on the smoothing parameter.
- Step 2: A weighted least squares regression is performed on the points in the neighborhood, with weights decreasing with distance from the target point.
- Step 3: The fitted value at the target point is computed from the local regression model.
- Step 4: This process is repeated for each point in the dataset, resulting in a smooth curve that captures the underlying trend.
Applications:
Loess is widely used in exploratory data analysis to visualize trends and patterns in data. It is particularly useful when the relationship between variables is complex and not well-represented by a simple linear or polynomial model.
Example in Python:
In Python, the statsmodels
library provides a function for performing Loess smoothing:
Code
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.nonparametric.smoothers_lowess import lowess
# Example data
= np.linspace(0, 10, 100)
x = np.sin(x) + np.random.normal(0, 0.1, 100)
y
# Apply Loess smoothing
= lowess(y, x, frac=0.2)
smoothed
# Plot
='Data', alpha=0.5)
plt.scatter(x, y, label0], smoothed[:, 1], color='red', label='Loess Smoothed')
plt.plot(smoothed[:,
plt.legend() plt.show()
In this example, frac
is the smoothing parameter that controls the amount of smoothing applied to the data.
Practical example using statsmodels
Let’s demonstrate a practical example of time series decomposition using statsmodels
routines seasonal_decompose
and STL
.
Steps:
- Load a sample time series dataset.
- Apply a decomposition technique (e.g., classical decomposition or STL).
- Visualize the decomposed components (trend, seasonal, and residual).
Air Passengers Dataset
Let’s load the Air Passengers dataset again.
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose
# Load a sample time series dataset
= pd.read_csv(os.path.join('data', 'air_passengers_1949_1960.csv'), index_col='Date', parse_dates=True)
data = data['Number of Passengers'] ts
statsmodels
Additive Model
We’ll apply the classical decomposition using the statsmodels
seasonal_decompose
function with the additive model.
Code
# Apply classical decomposition
= seasonal_decompose(ts, model='additive')
decomposition
# Plot the decomposed components
= plt.subplots(4, 1, figsize=(8, 8))
fig, (ax1, ax2, ax3, ax4) =ax1)
decomposition.observed.plot(ax'Observed')
ax1.set_ylabel('Air Passengers 1949-1960')
ax1.set_title(=ax2)
decomposition.trend.plot(ax'Trend')
ax2.set_ylabel(=ax3)
decomposition.seasonal.plot(ax'Seasonal')
ax3.set_ylabel(=ax4)
decomposition.resid.plot(ax'Residual')
ax4.set_ylabel(
plt.tight_layout() plt.show()
Does this look similar to the classical decomposition we did earlier?
statsmodels
Multiplicative Model
Let’s try the multiplicative model:
Code
# Apply classical decomposition
= seasonal_decompose(ts, model='multiplicative')
decomposition
# Plot the decomposed components
= plt.subplots(4, 1, figsize=(8, 8))
fig, (ax1, ax2, ax3, ax4) =ax1)
decomposition.observed.plot(ax'Observed')
ax1.set_ylabel(=ax2)
decomposition.trend.plot(ax'Trend')
ax2.set_ylabel(=ax3)
decomposition.seasonal.plot(ax'Seasonal')
ax3.set_ylabel(=ax4)
decomposition.resid.plot(ax'Residual')
ax4.set_ylabel(
plt.tight_layout() plt.show()
Question: What differences do you see between multiplicative and additive models? What can you infer from that?
STL Decomposition
The STL decomposition is more flexible and can handle the seasonality in the data better.
Let’s try the STL decomposition:
Code
from statsmodels.tsa.seasonal import STL
= STL(ts, period=12, robust=True)
stl = stl.fit()
result # result.plot()
# Plot the decomposed components
= plt.subplots(4, 1, figsize=(8, 6))
fig, (ax1, ax2, ax3, ax4) =ax1)
result.observed.plot(ax'Observed')
ax1.set_ylabel('Air Passengers 1949-1960 (STL)')
ax1.set_title(=ax2)
result.trend.plot(ax'Trend')
ax2.set_ylabel(=ax3)
result.seasonal.plot(ax'Seasonal')
ax3.set_ylabel(=ax4)
result.resid.plot(ax'Residual')
ax4.set_ylabel(
plt.tight_layout() plt.show()
Stationarity and Differencing
Definition of Stationarity
A time series is said to be stationary if its statistical properties such as mean, variance, and autocorrelation are constant over time.
Types of Stationarity:
- Strict Stationarity: The entire distribution of the time series remains unchanged over time.
- Weak Stationarity: Only the first two moments (mean and variance) are constant over time, and the covariance between two time points depends only on the time lag between them.
Importance of Stationarity
Stationarity is crucial in time series analysis because many statistical methods and models assume that the time series is stationary. Non-stationary data can lead to misleading results and poor forecasts.
Why Stationarity Matters
- Model Assumptions: Many time series models, such as ARIMA, require the data to be stationary to make accurate predictions.
- Statistical Properties: Stationary time series have consistent statistical properties over time, making it easier to analyze and interpret.
- Forecasting: Stationary data improves the reliability and accuracy of forecasts.
Techniques to Achieve Stationarity
- Differencing: Subtracting the previous observation from the current observation to remove trends and seasonality.
- Transformation: Applying mathematical transformations such as logarithms or square roots to stabilize the variance.
- Decomposition: Separating the time series into trend, seasonal, and residual components to analyze and adjust each part individually.
- Smoothing: Using techniques like moving averages to smooth out short-term fluctuations and highlight longer-term trends.
Tests for Stationarity
There are tests for stationarity such as listed below and available in statsmodels
.
- ADF Test (Augmented Dickey-Fuller Test): A statistical test used to determine if a time series is stationary. It tests the null hypothesis that a unit root2 is present in the time series.
- KPSS Test (Kwiatkowski-Phillips-Schmidt-Shin Test): Another test for stationarity that tests the null hypothesis that the time series is stationary around a deterministic trend.
The course notes have examples of both of these tests applied to the Air Passengers dataset.
ADF Test Demonstration
Let’s apply the tests, starting with ADF:
Code
import pandas as pd
from statsmodels.tsa.stattools import adfuller, kpss
= pd.read_csv(os.path.join('data', 'air_passengers_1949_1960.csv'))
air_passengers
# Apply ADF and KPSS tests on the air passenger data
# ADF Test
= adfuller(air_passengers['Number of Passengers'])
result_adf print('ADF Statistic:', result_adf[0])
print('p-value:', result_adf[1])
print('Critical Values:')
for key, value in result_adf[4].items():
print('\t%s: %.3f' % (key, value))
ADF Statistic: 0.815368879206052
p-value: 0.991880243437641
Critical Values:
1%: -3.482
5%: -2.884
10%: -2.579
Interpretation:
ADF Statistic: The ADF statistic is 0.815, which is greater than all the critical values at the 1%, 5%, and 10% significance levels.
p-value: The p-value is 0.991, which is significantly higher than common significance levels (e.g., 0.01, 0.05, 0.10).
Critical Values:
- 1%: -3.482
- 5%: -2.884
- 10%: -2.579
Conclusion:
Fail to Reject the Null Hypothesis: Since the ADF statistic (0.815) is greater than the critical values and the p-value (0.991) is much higher than typical significance levels, you fail to reject the null hypothesis. This suggests that the time series has a unit root and is non-stationary.
Implication: The time series data for the number of passengers is non-stationary, indicating that it may have a trend or other non-stationary components. To make the series stationary, you might consider differencing the data or applying other transformations, such as detrending or seasonal adjustment, before proceeding with further analysis or modeling.
KPSS Test Demonstration
Let’s apply the KPSS test.
Code
# KPSS Test
= kpss(air_passengers['Number of Passengers'], regression='c')
result_kpss print('\nKPSS Statistic:', result_kpss[0])
print('p-value:', result_kpss[1])
print('Critical Values:')
for key, value in result_kpss[3].items():
print('\t%s: %.3f' % (key, value))
KPSS Statistic: 1.6513122354165206
p-value: 0.01
Critical Values:
10%: 0.347
5%: 0.463
2.5%: 0.574
1%: 0.739
/var/folders/ly/jkydg4dj2vs93b_ds7yp5t7r0000gn/T/ipykernel_26421/3460980279.py:2: InterpolationWarning: The test statistic is outside of the range of p-values available in the
look-up table. The actual p-value is smaller than the p-value returned.
result_kpss = kpss(air_passengers['Number of Passengers'], regression='c')
The KPSS (Kwiatkowski-Phillips-Schmidt-Shin) test is another test used to assess the stationarity of a time series, but it has a different null hypothesis compared to the ADF test.
KPSS Test Interpretation:
Null Hypothesis ((H_0)): The null hypothesis of the KPSS test is that the time series is stationary around a deterministic trend (i.e., it does not have a unit root).
Alternative Hypothesis ((H_1)): The alternative hypothesis is that the time series is not stationary (i.e., it has a unit root).
Given Results:
- KPSS Statistic: 1.651
- p-value: 0.01
- Critical Values:
- 10%: 0.347
- 5%: 0.463
- 2.5%: 0.574
- 1%: 0.739
Conclusion:
Reject the Null Hypothesis: The KPSS statistic (1.651) is greater than all the critical values at the 10%, 5%, 2.5%, and 1% significance levels. This, along with the low p-value (0.01), suggests that you reject the null hypothesis of stationarity.
Implication: The time series is likely non-stationary according to the KPSS test. This aligns with the ADF test results, which also indicated non-stationarity.
Overall Interpretation:
Both the ADF and KPSS tests suggest that the time series is non-stationary. This consistent result from both tests strengthens the conclusion that the series may need differencing or other transformations to achieve stationarity before further analysis or modeling.
Time Series Models
In order to forecast future data values, we will now define a family of models and then apply one to our data.
Autoregressive (AR) models
A type of time series model where the current value of the series is based on its previous values.
The model is defined as
\[ y_t = c + \phi_1 y_{t-1} + \phi_2 y_{t-2} + \ldots + \phi_p y_{t-p} + \epsilon_t, \]
where
- \(y_t\) is the current value,
- \(c\) is a constant,
- \(\phi_1, \phi_2, \ldots, \phi_p\) are the coefficients,
- \(\epsilon_t\) is the white noise error term.
Key Points:
- Lag Order (p): The number of lagged observations included in the model.
- Stationarity: The series should be stationary for the AR model to be effective.
- Parameter Estimation: Methods like Yule-Walker equations or Maximum Likelihood Estimation (MLE) are used to estimate the parameters, \(\phi_i\).
Moving Average (MA) models
A type of time series model where the current value of the series is based on past forecast errors. The model is defined as
\[ y_t = \mu + \epsilon_t + \theta_1 \epsilon_{t-1} + \theta_2 \epsilon_{t-2} + \ldots + \theta_q \epsilon_{t-q}, \]
where
- \(y_t\) is the current value,
- \(\mu\) is the mean of the series,
- \(\epsilon_t\) is the white noise error term, e.g. the residuals
- \(\theta_1, \theta_2, \ldots, \theta_q\) are the coefficients.
Key Points:
- Lag Order (q): The number of lagged forecast errors included in the model.
- Stationarity: The series should be stationary for the MA model to be effective.
- Parameter Estimation: Methods like Maximum Likelihood Estimation (MLE) are used to estimate the parameters \(\theta_i\).
Autoregressive Integrated Moving Average (ARIMA) Models
A type of time series model that combines Autoregressive (AR) and Moving Average (MA) models with differencing to make the series stationary.
The model is defined as
\[ y_t = c + \phi_1 y_{t-1} + \phi_2 y_{t-2} + \ldots + \phi_p y_{t-p} + \epsilon_t + \theta_1 \epsilon_{t-1} + \theta_2 \epsilon_{t-2} + \ldots + \theta_q \epsilon_{t-q}, \]
where
- \(y_t\) is the current value,
- \(c\) is a constant,
- \(\phi_1, \phi_2, \ldots, \phi_p\) are the AR coefficients,
- \(\epsilon_t\) is the white noise error term,
- \(\theta_1, \theta_2, \ldots, \theta_q\) are the MA coefficients.
Key Points:
- Order (p, d, q): The parameters of the ARIMA model where \(p\) is the number of lag observations, \(d\) is the degree of differencing, and \(q\) is the size of the moving average window.
- Stationarity: Differencing is used to make the series stationary.
- Parameter Estimation: Methods like Maximum Likelihood Estimation (MLE) are used to estimate the parameters.
Seasonal ARIMA (SARIMA) Models
A type of time series model that extends ARIMA to support seasonality. The model is defined as
\[ \begin{aligned} y_t &= c + \phi_1 y_{t-1} + \phi_2 y_{t-2} + \ldots + \phi_p y_{t-p} + \epsilon_t \\ &\quad + \theta_1 \epsilon_{t-1} + \theta_2 \epsilon_{t-2} + \ldots + \theta_q \epsilon_{t-q} \\ &\quad + \Phi_1 Y_{t-s} + \Phi_2 Y_{t-2s} + \ldots + \Phi_P Y_{t-Ps} \\ &\quad + \Theta_1 E_{t-s} + \Theta_2 E_{t-2s} + \ldots + \Theta_Q E_{t-Qs} \end{aligned} \]
where
- \(y_t\) is the current value,
- \(c\) is a constant,
- \(\phi_1, \phi_2, \ldots, \phi_p\) are the AR coefficients,
- \(\epsilon_t\) is the white noise error term,
- \(\theta_1, \theta_2, \ldots, \theta_q\) are the MA coefficients,
- \(\Phi_1, \Phi_2, \ldots, \Phi_P\) are the seasonal AR coefficients,
- \(Y_{t-s}, Y_{t-2s}, \ldots, Y_{t-Ps}\) are the seasonal lagged observations,
- \(\Theta_1, \Theta_2, \ldots, \Theta_Q\) are the seasonal MA coefficients,
- \(E_{t-s}, E_{t-2s}, \ldots, E_{t-Qs}\) are the seasonal lagged forecast errors.
Key Points:
- Order (p, d, q) x (P, D, Q, s): The parameters of the SARIMA model where \(p, d, q\) are the non-seasonal parameters, \(P, D, Q\) are the seasonal parameters, and \(s\) is the length of the seasonal cycle.
- Stationarity: Differencing is used to make the series stationary.
- Parameter Estimation: Methods like Maximum Likelihood Estimation (MLE) are used to estimate the parameters.
SARIMA Model Forecast
Let’s demonstrate forecasting with a SARIMA model.
Code
import pandas as pd
import numpy as np
from statsmodels.tsa.statespace.sarimax import SARIMAX
import matplotlib.pyplot as plt
We’ll load the dataset and show the head.
Code
= os.path.join('data', 'air_passengers_1949_1960.csv')
path = pd.read_csv(path)
data data.head()
Date | Number of Passengers | |
---|---|---|
0 | 1949-01 | 112 |
1 | 1949-02 | 118 |
2 | 1949-03 | 132 |
3 | 1949-04 | 129 |
4 | 1949-05 | 121 |
Next, use pandas date_range
method to create a new month column. We could have used date string conversion tools on the Date
column alternatively.
Code
'Month'] = pd.date_range(start='1949-01', periods=len(data), freq='ME')
data['Month', inplace=True)
data.set_index(
data.head()
Date | Number of Passengers | |
---|---|---|
Month | ||
1949-01-31 | 1949-01 | 112 |
1949-02-28 | 1949-02 | 118 |
1949-03-31 | 1949-03 | 132 |
1949-04-30 | 1949-04 | 129 |
1949-05-31 | 1949-05 | 121 |
As a next step, we’ll log transform the data to stabilize the variance.
Code
# Log transform to stabilize variance
'Log_Passengers'] = np.log(data['Number of Passengers'])
data[ data.head()
Date | Number of Passengers | Log_Passengers | |
---|---|---|---|
Month | |||
1949-01-31 | 1949-01 | 112 | 4.718499 |
1949-02-28 | 1949-02 | 118 | 4.770685 |
1949-03-31 | 1949-03 | 132 | 4.882802 |
1949-04-30 | 1949-04 | 129 | 4.859812 |
1949-05-31 | 1949-05 | 121 | 4.795791 |
Now let’s create an instance of a SARIMAX model – Seasonal Autoregressive Integrated Moving Average with eXogenous regressors.
It extends the SARIMA model to support exogenous variables, although we aren’t using any.
# SARIMA model
= SARIMAX(data['Log_Passengers'],
model =(1, 1, 1),
order=(1, 1, 1, 12),
seasonal_order='ME') freq
/Users/tomg/Source/courses/tools4ds/DS701-Course-Notes/.venv/lib/python3.12/site-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: No frequency information was provided, so inferred frequency ME will be used.
self._init_dates(dates, freq)
The configuration is as follows:
- We’ll use the
Log_Passengers
as the time series, which has better variance properties. order=(1, 1, 1)
means we’re using an ARIMA model with 1 autoregressive term, 1 differencing term, and 1 moving average term.seasonal_order=(1, 1, 1, 12)
means we’re using a seasonal ARIMA model with 1 autoregressive term, 1 differencing term, 1 moving average term, and a seasonal cycle of 12 months.- The frequency is the month-end, or
ME
.
We’ll fit the model which also prints the fit summary.
Code
= model.fit() results
RUNNING THE L-BFGS-B CODE
* * *
Machine precision = 2.220D-16
N = 5 M = 10
At X0 0 variables are exactly at the bounds
At iterate 0 f= -1.66507D+00 |proj g|= 5.05756D+00
This problem is unconstrained.
At iterate 5 f= -1.68182D+00 |proj g|= 8.72738D-01
At iterate 10 f= -1.69016D+00 |proj g|= 7.99143D-02
At iterate 15 f= -1.69026D+00 |proj g|= 2.95076D-01
At iterate 20 f= -1.69545D+00 |proj g|= 1.54990D+00
At iterate 25 f= -1.69767D+00 |proj g|= 5.43471D-02
At iterate 30 f= -1.69818D+00 |proj g|= 6.44974D-02
At iterate 35 f= -1.70189D+00 |proj g|= 2.19161D-01
At iterate 40 f= -1.70244D+00 |proj g|= 7.77238D-03
* * *
Tit = total number of iterations
Tnf = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip = number of BFGS updates skipped
Nact = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F = final function value
* * *
N Tit Tnf Tnint Skip Nact Projg F
5 41 55 1 0 0 7.116D-03 -1.702D+00
F = -1.7024430820229224
CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH
In the Course Notes version, there are also model results summaries.
We’ll print the summary and plot the diagnostics.
Code
# Summary and diagnostics
print(results.summary())
SARIMAX Results
==========================================================================================
Dep. Variable: Log_Passengers No. Observations: 144
Model: SARIMAX(1, 1, 1)x(1, 1, 1, 12) Log Likelihood 245.152
Date: Thu, 21 Nov 2024 AIC -480.304
Time: 11:41:21 BIC -465.928
Sample: 01-31-1949 HQIC -474.462
- 12-31-1960
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ar.L1 0.1647 0.213 0.775 0.439 -0.252 0.582
ma.L1 -0.5602 0.184 -3.037 0.002 -0.922 -0.199
ar.S.L12 -0.0998 0.197 -0.507 0.612 -0.486 0.286
ma.S.L12 -0.4966 0.210 -2.364 0.018 -0.908 -0.085
sigma2 0.0013 0.000 8.471 0.000 0.001 0.002
===================================================================================
Ljung-Box (L1) (Q): 0.01 Jarque-Bera (JB): 3.44
Prob(Q): 0.92 Prob(JB): 0.18
Heteroskedasticity (H): 0.61 Skew: -0.01
Prob(H) (two-sided): 0.11 Kurtosis: 3.79
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
And let’s plot some diagnostic information.
Code
=(12, 8))
results.plot_diagnostics(figsize plt.show()
Explanation of SARIMAX Diagnostics
The plot_diagnostics
function provides several diagnostic plots to evaluate the fit of the SARIMAX model. Here is an explanation of each panel:
- Standardized Residuals:
- This plot shows the residuals (differences between the observed and predicted values) standardized to have zero mean and unit variance.
- Ideally, the residuals should appear as white noise, meaning they should be randomly scattered around zero with no discernible pattern.
- Histogram plus KDE:
- This panel displays a histogram of the residuals along with a Kernel Density Estimate (KDE) to visualize the distribution.
- The histogram should resemble a normal distribution if the model is well-fitted. The KDE line helps to see the shape of the distribution more clearly.
- Normal Q-Q Plot:
- The Q-Q plot compares the quantiles of the residuals to the quantiles of a standard normal distribution.
- If the residuals are normally distributed, the points should lie approximately along the 45-degree line.
- Correlogram (ACF Plot):
- This plot shows the autocorrelation function (ACF) of the residuals.
- The ACF plot helps to identify any remaining autocorrelation in the residuals. Ideally, the residuals should have no significant autocorrelation, indicating that the model has captured all the temporal dependencies.
Finally, let’s make a forecast for 2 years out.
Code
# Forecasting
= results.get_forecast(steps=24)
forecast = pd.date_range(data.index[-1] + pd.DateOffset(months=1), periods=24, freq='ME')
forecast_index = np.exp(forecast.predicted_mean) # Convert back from log
forecast_values = np.exp(forecast.conf_int())
confidence_intervals
# Plot
=(10, 6))
plt.figure(figsize'Number of Passengers'], label='Observed')
plt.plot(data[='Forecast', color='red')
plt.plot(forecast_index, forecast_values, label0], confidence_intervals.iloc[:, 1], color='pink', alpha=0.3)
plt.fill_between(forecast_index, confidence_intervals.iloc[:,
plt.legend() plt.show()
The light pink area shows the 95% confidence interval for the forecast.
Question: What observations do you have about the prediction?
Model Evaluation and Forecasting
Criteria for Model Selection
AIC (Akaike Information Criterion): A measure of the relative quality of a statistical model for a given set of data. It is defined as:
\[ \text{AIC} = T\log\left(\frac{\text{SSE}}{T}\right) + 2(k+2) \]
where
- \(\text{SSE}\) is the fit of the model as Sum of Squared Error
- \(T\) is the number of observations,
- \(k\) is the number of parameters in the model,
BIC (Bayesian Information Criterion): Similar to AIC but includes a penalty term for the number of parameters in the model. It is defined as:
\[ \text{BIC} = T\log\left(\frac{\text{SSE}}{T}\right) + (k+2)\log(T) \]
where
- \(\text{SSE}\) is the fit of the model as Sum of Squared Error
- \(T\) is the number of observations,
- \(k\) is the number of parameters in the model,
Key Points:
- Model Comparison: Both AIC and BIC are used to compare different models; the model with the lower AIC or BIC is preferred.
- Penalty for Complexity: BIC imposes a larger penalty for models with more parameters compared to AIC.
- Trade-off: There is a trade-off between goodness of fit and model complexity.
Cross-Validation Techniques
- Time Series Cross-Validation: Unlike traditional cross-validation, time series cross-validation respects the temporal order of the data.
Common techniques include:
- Rolling Forecast Origin: The training set is expanded with each iteration, and the model is re-evaluated.
- Time Series Split: The data is split into multiple training and test sets, ensuring that the training set always precedes the test set.
- Blocked Cross-Validation: The data is divided into blocks, and each block is used as a test set while the preceding blocks are used for training.
Key Points:
- Respect Temporal Order: Ensure that the training set always precedes the test set to avoid data leakage.
- Multiple Techniques: Various techniques like rolling forecast origin, time series split, and blocked cross-validation can be used.
Forecasting and Confidence Intervals
- Forecasting: The process of making predictions about future values based on historical data. We can use the previously introduced methods such as ARIMA, but you can also use Exponential Smoothing. This is a technique that applies decreasing weights to past observations, giving more importance to recent data. With exponenetial smoothing you are using a weighted average of past predictions to make a forecast.
- Confidence Intervals: Provide an estimate of the uncertainty associated with the forecast.
- Calculation: Confidence intervals are typically calculated using the standard error of the forecast and a critical value from the t-distribution or normal distribution.
- Interpretation: A 95% confidence interval means that there is a 95% chance that the true value will fall within the interval.
Recap and References
Recap
We covered the following topics:
- Components of Time Series
- Time Series Decomposition
- Stationarity and Differencing
- Model Selection and Evaluation
- Forecasting and Confidence Intervals
References
Bibliography
Footnotes
gpt-4o, personal communication, Nov 2024↩︎
Unit root relates to stability in autoregressive models. AR models can be expressed as polynomial equations and the value of the roots of the polynomial determine the stability of the model after perturbation. Roots on the unit circle will indicate that perturbations will have a permanent effect, making the model non-stationary.↩︎