Python Machine Learning - Time Series Analysis Tutorial

Introduction to Time Series Analysis

Time series analysis is a statistical technique used to analyze and model data points collected over time. It involves studying the patterns, trends, and seasonal variations in the data to make predictions about future values.

Characteristics of Time Series Data

Trend: A long-term increase or decrease in the data
Seasonality: Regular patterns that repeat over a specific period (e.g., daily, monthly, yearly)
Cyclical Patterns: Fluctuations that occur over longer periods than seasonality
Irregularity: Random variations or noise in the data
Stationarity: Statistical properties (mean, variance) that remain constant over time

Applications of Time Series Analysis

Financial forecasting (stock prices, market trends)
Weather and climate prediction
Demand forecasting (retail, manufacturing)
Energy consumption prediction
Economic indicators analysis
Healthcare data monitoring

Time Series Data Preprocessing

Before analyzing time series data, it's important to preprocess it to ensure quality and suitability for modeling.

Data Loading and Exploration

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load time series data
df = pd.read_csv('time_series_data.csv', parse_dates=['date'], index_col='date')

# Explore data
print(df.head())
print(df.info())
print(df.describe())

# Plot the time series
plt.figure(figsize=(12, 6))
plt.plot(df)
plt.title('Time Series Data')
plt.xlabel('Date')
plt.ylabel('Value')
plt.show()

Handling Missing Values

# Check for missing values
print(df.isnull().sum())

# Fill missing values with forward fill
df_filled = df.fillna(method='ffill')

# Fill missing values with backward fill
df_filled = df.fillna(method='bfill')

# Fill missing values with mean
df_filled = df.fillna(df.mean())

# Interpolate missing values
df_filled = df.interpolate()

Resampling Time Series Data

# Resample to monthly frequency (mean)
monthly_data = df.resample('M').mean()

# Resample to quarterly frequency (sum)
quarterly_data = df.resample('Q').sum()

# Resample to annual frequency (max)
annual_data = df.resample('A').max()

Stationarity Testing

Many time series models require the data to be stationary. We can test for stationarity using the Augmented Dickey-Fuller (ADF) test.

from statsmodels.tsa.stattools import adfuller

# ADF test
result = adfuller(df['value'])
print('ADF Statistic:', result[0])
print('p-value:', result[1])
print('Critical Values:', result[4])

# Interpret the result
if result[1] < 0.05:
    print('Data is stationary')
else:
    print('Data is not stationary')

Making Data Stationary

# Differencing to remove trend
df_diff = df.diff().dropna()

# Log transformation to stabilize variance
df_log = np.log(df)
df_log_diff = df_log.diff().dropna()

# Seasonal differencing
df_seasonal_diff = df.diff(12).dropna()  # 12 for monthly data

Time Series Analysis Techniques

Decomposition

Time series decomposition separates a time series into its components: trend, seasonality, and residual.

from statsmodels.tsa.seasonal import seasonal_decompose

# Decompose the time series
result = seasonal_decompose(df['value'], model='additive', period=12)

# Plot the components
plt.figure(figsize=(12, 10))
plt.subplot(411)
plt.plot(df['value'], label='Original')
plt.legend(loc='best')
plt.subplot(412)
plt.plot(result.trend, label='Trend')
plt.legend(loc='best')
plt.subplot(413)
plt.plot(result.seasonal, label='Seasonality')
plt.legend(loc='best')
plt.subplot(414)
plt.plot(result.resid, label='Residual')
plt.legend(loc='best')
plt.tight_layout()
plt.show()

Autocorrelation Analysis

Autocorrelation (ACF) and partial autocorrelation (PACF) plots help identify patterns in time series data.

from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

# ACF plot
plt.figure(figsize=(12, 6))
plot_acf(df['value'], lags=30)
plt.title('Autocorrelation Function')
plt.show()

# PACF plot
plt.figure(figsize=(12, 6))
plot_pacf(df['value'], lags=30)
plt.title('Partial Autocorrelation Function')
plt.show()

Time Series Forecasting Models

ARIMA Model

ARIMA (AutoRegressive Integrated Moving Average) is a popular model for time series forecasting.

from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error

# Split data into train and test
train_size = int(len(df) * 0.8)
train, test = df['value'][:train_size], df['value'][train_size:]

# Fit ARIMA model
model = ARIMA(train, order=(1, 1, 1))
model_fit = model.fit()
print(model_fit.summary())

# Make predictions
predictions = model_fit.forecast(steps=len(test))

# Calculate error
mse = mean_squared_error(test, predictions)
rmse = np.sqrt(mse)
print(f'RMSE: {rmse}')

# Plot results
plt.figure(figsize=(12, 6))
plt.plot(train, label='Train')
plt.plot(test, label='Test')
plt.plot(test.index, predictions, label='Predictions')
plt.title('ARIMA Model Forecast')
plt.legend()
plt.show()

SARIMA Model

SARIMA (Seasonal ARIMA) extends ARIMA to handle seasonal time series data.

from statsmodels.tsa.statespace.sarimax import SARIMAX

# Fit SARIMA model
model = SARIMAX(train, order=(1, 1, 1), seasonal_order=(1, 1, 1, 12))
model_fit = model.fit()
print(model_fit.summary())

# Make predictions
predictions = model_fit.forecast(steps=len(test))

# Calculate error
mse = mean_squared_error(test, predictions)
rmse = np.sqrt(mse)
print(f'RMSE: {rmse}')

# Plot results
plt.figure(figsize=(12, 6))
plt.plot(train, label='Train')
plt.plot(test, label='Test')
plt.plot(test.index, predictions, label='Predictions')
plt.title('SARIMA Model Forecast')
plt.legend()
plt.show()

Prophet

Prophet is a forecasting tool developed by Facebook that handles seasonality well.

from prophet import Prophet

# Prepare data for Prophet
df_prophet = df.reset_index()
df_prophet.columns = ['ds', 'y']

# Split data
 train_size = int(len(df_prophet) * 0.8)
train_prophet, test_prophet = df_prophet[:train_size], df_prophet[train_size:]

# Fit Prophet model
model = Prophet()
model.fit(train_prophet)

# Make predictions
future = model.make_future_dataframe(periods=len(test_prophet))
forecast = model.predict(future)

# Extract predictions
predictions = forecast['yhat'][-len(test_prophet):]

# Calculate error
mse = mean_squared_error(test_prophet['y'], predictions)
rmse = np.sqrt(mse)
print(f'RMSE: {rmse}')

# Plot results
fig = model.plot(forecast)
plt.title('Prophet Model Forecast')
plt.show()

# Plot components
fig2 = model.plot_components(forecast)
plt.show()

LSTM for Time Series Forecasting

Long Short-Term Memory (LSTM) networks are a type of recurrent neural network that can capture long-term dependencies in time series data.

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from sklearn.preprocessing import MinMaxScaler

# Prepare data for LSTM
scaler = MinMaxScaler(feature_range=(0, 1))
data_scaled = scaler.fit_transform(df['value'].values.reshape(-1, 1))

# Create training data with lookback
lookback = 12
X, y = [], []
for i in range(lookback, len(data_scaled)):
    X.append(data_scaled[i-lookback:i, 0])
    y.append(data_scaled[i, 0])
X, y = np.array(X), np.array(y)

# Reshape for LSTM
X = np.reshape(X, (X.shape[0], X.shape[1], 1))

# Split data
train_size = int(len(X) * 0.8)
X_train, X_test = X[:train_size], X[train_size:]
y_train, y_test = y[:train_size], y[train_size:]

# Build LSTM model
model = Sequential()
model.add(LSTM(units=50, return_sequences=True, input_shape=(X_train.shape[1], 1)))
model.add(Dropout(0.2))
model.add(LSTM(units=50, return_sequences=False))
model.add(Dropout(0.2))
model.add(Dense(units=1))

# Compile and fit model
model.compile(optimizer='adam', loss='mean_squared_error')
model.fit(X_train, y_train, epochs=50, batch_size=32)

# Make predictions
predictions = model.predict(X_test)
predictions = scaler.inverse_transform(predictions)
y_test_actual = scaler.inverse_transform(y_test.reshape(-1, 1))

# Calculate error
mse = mean_squared_error(y_test_actual, predictions)
rmse = np.sqrt(mse)
print(f'RMSE: {rmse}')

# Plot results
plt.figure(figsize=(12, 6))
plt.plot(y_test_actual, label='Actual')
plt.plot(predictions, label='Predictions')
plt.title('LSTM Model Forecast')
plt.legend()
plt.show()

Advanced Time Series Techniques

Vector Autoregression (VAR)

VAR is used to model multiple time series variables that influence each other.

from statsmodels.tsa.vector_ar.var_model import VAR

# Assume we have two time series variables
data = pd.DataFrame({
    'var1': df['value'],
    'var2': df['value'].shift(1).fillna(df['value'].mean())
})

# Split data
train_size = int(len(data) * 0.8)
train, test = data[:train_size], data[train_size:]

# Fit VAR model
model = VAR(train)
model_fit = model.fit()
print(model_fit.summary())

# Make predictions
lag_order = model_fit.k_ar
predictions = model_fit.forecast(train.values[-lag_order:], steps=len(test))

# Calculate error
mse = mean_squared_error(test.values, predictions)
rmse = np.sqrt(mse)
print(f'RMSE: {rmse}')

Time Series Clustering

Clustering techniques can be used to group similar time series together.

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Generate multiple time series
np.random.seed(42)
n_series = 10
n_points = 100
time_series = []

for i in range(n_series):
    # Create different patterns
    if i < 3:
        # Trend
        series = np.linspace(0, 10, n_points) + np.random.randn(n_points) * 0.5
    elif i < 6:
        # Seasonal
        series = np.sin(np.linspace(0, 4 * np.pi, n_points)) * 5 + np.random.randn(n_points) * 0.5
    else:
        # Random
        series = np.random.randn(n_points) * 2
    time_series.append(series)

# Prepare data for clustering
X = np.array(time_series)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply K-means clustering
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(X_scaled)

# Plot results
plt.figure(figsize=(15, 10))
for i, cluster in enumerate(clusters):
    plt.subplot(3, 4, i+1)
    plt.plot(time_series[i])
    plt.title(f'Series {i+1}, Cluster {cluster}')
plt.tight_layout()
plt.show()

Anomaly Detection in Time Series

We can detect anomalies in time series data using various techniques.

from sklearn.ensemble import IsolationForest

# Generate data with anomaly
np.random.seed(42)
n_points = 100
# Normal data
series = np.sin(np.linspace(0, 4 * np.pi, n_points)) * 5 + np.random.randn(n_points) * 0.5
# Add anomaly
series[50] = 20

# Prepare data
X = series.reshape(-1, 1)

# Fit isolation forest
model = IsolationForest(contamination=0.05, random_state=42)
anomalies = model.fit_predict(X)

# Plot results
plt.figure(figsize=(12, 6))
plt.plot(series, label='Time Series')
anomaly_indices = np.where(anomalies == -1)[0]
plt.scatter(anomaly_indices, series[anomaly_indices], color='red', label='Anomalies')
plt.title('Anomaly Detection in Time Series')
plt.legend()
plt.show()

Practice Case: Sales Forecasting

In this practice case, we'll use time series analysis to forecast sales data.

Step 1: Load and Explore Data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load sales data
df = pd.read_csv('sales_data.csv', parse_dates=['date'], index_col='date')

# Explore data
print(df.head())
print(df.info())

# Plot sales data
plt.figure(figsize=(12, 6))
plt.plot(df['sales'])
plt.title('Sales Data')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.show()

Step 2: Data Preprocessing

# Check for missing values
print(df.isnull().sum())

# Fill missing values
df = df.fillna(method='ffill')

# Resample to monthly frequency
monthly_sales = df.resample('M').sum()

# Plot monthly sales
plt.figure(figsize=(12, 6))
plt.plot(monthly_sales['sales'])
plt.title('Monthly Sales')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.show()

Step 3: Decompose Time Series

from statsmodels.tsa.seasonal import seasonal_decompose

# Decompose the time series
result = seasonal_decompose(monthly_sales['sales'], model='additive', period=12)

# Plot components
plt.figure(figsize=(12, 10))
plt.subplot(411)
plt.plot(monthly_sales['sales'], label='Original')
plt.legend(loc='best')
plt.subplot(412)
plt.plot(result.trend, label='Trend')
plt.legend(loc='best')
plt.subplot(413)
plt.plot(result.seasonal, label='Seasonality')
plt.legend(loc='best')
plt.subplot(414)
plt.plot(result.resid, label='Residual')
plt.legend(loc='best')
plt.tight_layout()
plt.show()

Step 4: Fit SARIMA Model

from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.metrics import mean_squared_error

# Split data
train_size = int(len(monthly_sales) * 0.8)
train, test = monthly_sales['sales'][:train_size], monthly_sales['sales'][train_size:]

# Fit SARIMA model
model = SARIMAX(train, order=(1, 1, 1), seasonal_order=(1, 1, 1, 12))
model_fit = model.fit()
print(model_fit.summary())

# Make predictions
predictions = model_fit.forecast(steps=len(test))

# Calculate error
mse = mean_squared_error(test, predictions)
rmse = np.sqrt(mse)
print(f'RMSE: {rmse}')

# Plot results
plt.figure(figsize=(12, 6))
plt.plot(train, label='Train')
plt.plot(test, label='Test')
plt.plot(test.index, predictions, label='Predictions')
plt.title('SARIMA Model Forecast')
plt.legend()
plt.show()

Step 5: Fit Prophet Model

from prophet import Prophet

# Prepare data for Prophet
df_prophet = monthly_sales.reset_index()
df_prophet.columns = ['ds', 'y']

# Split data
train_size = int(len(df_prophet) * 0.8)
train_prophet, test_prophet = df_prophet[:train_size], df_prophet[train_size:]

# Fit Prophet model
model = Prophet()
model.fit(train_prophet)

# Make predictions
future = model.make_future_dataframe(periods=len(test_prophet), freq='M')
forecast = model.predict(future)

# Extract predictions
predictions = forecast['yhat'][-len(test_prophet):]

# Calculate error
mse = mean_squared_error(test_prophet['y'], predictions)
rmse = np.sqrt(mse)
print(f'RMSE: {rmse}')

# Plot results
fig = model.plot(forecast)
plt.title('Prophet Model Forecast')
plt.show()

# Plot components
fig2 = model.plot_components(forecast)
plt.show()

Interactive Exercises

Exercise 1: Time Series Exploration

Load the provided sales data, explore its characteristics, and identify any trends or seasonality patterns.

Start Exercise

Exercise 2: ARIMA Model Tuning

Use the sales data to find the optimal ARIMA model parameters using grid search, and evaluate the model's performance.

Start Exercise

Exercise 3: LSTM Forecasting

Implement an LSTM model to forecast the sales data, and compare its performance with ARIMA and Prophet models.

Start Exercise

Exercise 4: Anomaly Detection

Add some anomalies to the sales data and use isolation forest to detect them.

Start Exercise

Time Series Analysis

Introduction to Time Series Analysis

Characteristics of Time Series Data

Applications of Time Series Analysis

Time Series Data Preprocessing

Data Loading and Exploration

Handling Missing Values

Resampling Time Series Data

Stationarity Testing

Making Data Stationary

Time Series Analysis Techniques

Decomposition

Autocorrelation Analysis

Time Series Forecasting Models

ARIMA Model

SARIMA Model

Prophet

LSTM for Time Series Forecasting

Advanced Time Series Techniques

Vector Autoregression (VAR)

Time Series Clustering

Anomaly Detection in Time Series

Practice Case: Sales Forecasting

Step 1: Load and Explore Data

Step 2: Data Preprocessing

Step 3: Decompose Time Series

Step 4: Fit SARIMA Model

Step 5: Fit Prophet Model

Interactive Exercises

Exercise 1: Time Series Exploration

Exercise 2: ARIMA Model Tuning

Exercise 3: LSTM Forecasting

Exercise 4: Anomaly Detection

Related Tutorials

Supervised Learning

Deep Learning Basics

Model Evaluation