Introduction to Time Series Analysis
Time series analysis is a statistical technique used to analyze and model data points collected over time. It involves studying the patterns, trends, and seasonal variations in the data to make predictions about future values.
Characteristics of Time Series Data
- Trend: A long-term increase or decrease in the data
- Seasonality: Regular patterns that repeat over a specific period (e.g., daily, monthly, yearly)
- Cyclical Patterns: Fluctuations that occur over longer periods than seasonality
- Irregularity: Random variations or noise in the data
- Stationarity: Statistical properties (mean, variance) that remain constant over time
Applications of Time Series Analysis
- Financial forecasting (stock prices, market trends)
- Weather and climate prediction
- Demand forecasting (retail, manufacturing)
- Energy consumption prediction
- Economic indicators analysis
- Healthcare data monitoring
Time Series Data Preprocessing
Before analyzing time series data, it's important to preprocess it to ensure quality and suitability for modeling.
Data Loading and Exploration
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Load time series data
df = pd.read_csv('time_series_data.csv', parse_dates=['date'], index_col='date')
# Explore data
print(df.head())
print(df.info())
print(df.describe())
# Plot the time series
plt.figure(figsize=(12, 6))
plt.plot(df)
plt.title('Time Series Data')
plt.xlabel('Date')
plt.ylabel('Value')
plt.show()
Handling Missing Values
# Check for missing values print(df.isnull().sum()) # Fill missing values with forward fill df_filled = df.fillna(method='ffill') # Fill missing values with backward fill df_filled = df.fillna(method='bfill') # Fill missing values with mean df_filled = df.fillna(df.mean()) # Interpolate missing values df_filled = df.interpolate()
Resampling Time Series Data
# Resample to monthly frequency (mean)
monthly_data = df.resample('M').mean()
# Resample to quarterly frequency (sum)
quarterly_data = df.resample('Q').sum()
# Resample to annual frequency (max)
annual_data = df.resample('A').max()
Stationarity Testing
Many time series models require the data to be stationary. We can test for stationarity using the Augmented Dickey-Fuller (ADF) test.
from statsmodels.tsa.stattools import adfuller
# ADF test
result = adfuller(df['value'])
print('ADF Statistic:', result[0])
print('p-value:', result[1])
print('Critical Values:', result[4])
# Interpret the result
if result[1] < 0.05:
print('Data is stationary')
else:
print('Data is not stationary')
Making Data Stationary
# Differencing to remove trend df_diff = df.diff().dropna() # Log transformation to stabilize variance df_log = np.log(df) df_log_diff = df_log.diff().dropna() # Seasonal differencing df_seasonal_diff = df.diff(12).dropna() # 12 for monthly data
Time Series Analysis Techniques
Decomposition
Time series decomposition separates a time series into its components: trend, seasonality, and residual.
from statsmodels.tsa.seasonal import seasonal_decompose # Decompose the time series result = seasonal_decompose(df['value'], model='additive', period=12) # Plot the components plt.figure(figsize=(12, 10)) plt.subplot(411) plt.plot(df['value'], label='Original') plt.legend(loc='best') plt.subplot(412) plt.plot(result.trend, label='Trend') plt.legend(loc='best') plt.subplot(413) plt.plot(result.seasonal, label='Seasonality') plt.legend(loc='best') plt.subplot(414) plt.plot(result.resid, label='Residual') plt.legend(loc='best') plt.tight_layout() plt.show()
Autocorrelation Analysis
Autocorrelation (ACF) and partial autocorrelation (PACF) plots help identify patterns in time series data.
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
# ACF plot
plt.figure(figsize=(12, 6))
plot_acf(df['value'], lags=30)
plt.title('Autocorrelation Function')
plt.show()
# PACF plot
plt.figure(figsize=(12, 6))
plot_pacf(df['value'], lags=30)
plt.title('Partial Autocorrelation Function')
plt.show()
Time Series Forecasting Models
ARIMA Model
ARIMA (AutoRegressive Integrated Moving Average) is a popular model for time series forecasting.
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error
# Split data into train and test
train_size = int(len(df) * 0.8)
train, test = df['value'][:train_size], df['value'][train_size:]
# Fit ARIMA model
model = ARIMA(train, order=(1, 1, 1))
model_fit = model.fit()
print(model_fit.summary())
# Make predictions
predictions = model_fit.forecast(steps=len(test))
# Calculate error
mse = mean_squared_error(test, predictions)
rmse = np.sqrt(mse)
print(f'RMSE: {rmse}')
# Plot results
plt.figure(figsize=(12, 6))
plt.plot(train, label='Train')
plt.plot(test, label='Test')
plt.plot(test.index, predictions, label='Predictions')
plt.title('ARIMA Model Forecast')
plt.legend()
plt.show()
SARIMA Model
SARIMA (Seasonal ARIMA) extends ARIMA to handle seasonal time series data.
from statsmodels.tsa.statespace.sarimax import SARIMAX
# Fit SARIMA model
model = SARIMAX(train, order=(1, 1, 1), seasonal_order=(1, 1, 1, 12))
model_fit = model.fit()
print(model_fit.summary())
# Make predictions
predictions = model_fit.forecast(steps=len(test))
# Calculate error
mse = mean_squared_error(test, predictions)
rmse = np.sqrt(mse)
print(f'RMSE: {rmse}')
# Plot results
plt.figure(figsize=(12, 6))
plt.plot(train, label='Train')
plt.plot(test, label='Test')
plt.plot(test.index, predictions, label='Predictions')
plt.title('SARIMA Model Forecast')
plt.legend()
plt.show()
Prophet
Prophet is a forecasting tool developed by Facebook that handles seasonality well.
from prophet import Prophet
# Prepare data for Prophet
df_prophet = df.reset_index()
df_prophet.columns = ['ds', 'y']
# Split data
train_size = int(len(df_prophet) * 0.8)
train_prophet, test_prophet = df_prophet[:train_size], df_prophet[train_size:]
# Fit Prophet model
model = Prophet()
model.fit(train_prophet)
# Make predictions
future = model.make_future_dataframe(periods=len(test_prophet))
forecast = model.predict(future)
# Extract predictions
predictions = forecast['yhat'][-len(test_prophet):]
# Calculate error
mse = mean_squared_error(test_prophet['y'], predictions)
rmse = np.sqrt(mse)
print(f'RMSE: {rmse}')
# Plot results
fig = model.plot(forecast)
plt.title('Prophet Model Forecast')
plt.show()
# Plot components
fig2 = model.plot_components(forecast)
plt.show()
LSTM for Time Series Forecasting
Long Short-Term Memory (LSTM) networks are a type of recurrent neural network that can capture long-term dependencies in time series data.
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from sklearn.preprocessing import MinMaxScaler
# Prepare data for LSTM
scaler = MinMaxScaler(feature_range=(0, 1))
data_scaled = scaler.fit_transform(df['value'].values.reshape(-1, 1))
# Create training data with lookback
lookback = 12
X, y = [], []
for i in range(lookback, len(data_scaled)):
X.append(data_scaled[i-lookback:i, 0])
y.append(data_scaled[i, 0])
X, y = np.array(X), np.array(y)
# Reshape for LSTM
X = np.reshape(X, (X.shape[0], X.shape[1], 1))
# Split data
train_size = int(len(X) * 0.8)
X_train, X_test = X[:train_size], X[train_size:]
y_train, y_test = y[:train_size], y[train_size:]
# Build LSTM model
model = Sequential()
model.add(LSTM(units=50, return_sequences=True, input_shape=(X_train.shape[1], 1)))
model.add(Dropout(0.2))
model.add(LSTM(units=50, return_sequences=False))
model.add(Dropout(0.2))
model.add(Dense(units=1))
# Compile and fit model
model.compile(optimizer='adam', loss='mean_squared_error')
model.fit(X_train, y_train, epochs=50, batch_size=32)
# Make predictions
predictions = model.predict(X_test)
predictions = scaler.inverse_transform(predictions)
y_test_actual = scaler.inverse_transform(y_test.reshape(-1, 1))
# Calculate error
mse = mean_squared_error(y_test_actual, predictions)
rmse = np.sqrt(mse)
print(f'RMSE: {rmse}')
# Plot results
plt.figure(figsize=(12, 6))
plt.plot(y_test_actual, label='Actual')
plt.plot(predictions, label='Predictions')
plt.title('LSTM Model Forecast')
plt.legend()
plt.show()
Advanced Time Series Techniques
Vector Autoregression (VAR)
VAR is used to model multiple time series variables that influence each other.
from statsmodels.tsa.vector_ar.var_model import VAR
# Assume we have two time series variables
data = pd.DataFrame({
'var1': df['value'],
'var2': df['value'].shift(1).fillna(df['value'].mean())
})
# Split data
train_size = int(len(data) * 0.8)
train, test = data[:train_size], data[train_size:]
# Fit VAR model
model = VAR(train)
model_fit = model.fit()
print(model_fit.summary())
# Make predictions
lag_order = model_fit.k_ar
predictions = model_fit.forecast(train.values[-lag_order:], steps=len(test))
# Calculate error
mse = mean_squared_error(test.values, predictions)
rmse = np.sqrt(mse)
print(f'RMSE: {rmse}')
Time Series Clustering
Clustering techniques can be used to group similar time series together.
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
# Generate multiple time series
np.random.seed(42)
n_series = 10
n_points = 100
time_series = []
for i in range(n_series):
# Create different patterns
if i < 3:
# Trend
series = np.linspace(0, 10, n_points) + np.random.randn(n_points) * 0.5
elif i < 6:
# Seasonal
series = np.sin(np.linspace(0, 4 * np.pi, n_points)) * 5 + np.random.randn(n_points) * 0.5
else:
# Random
series = np.random.randn(n_points) * 2
time_series.append(series)
# Prepare data for clustering
X = np.array(time_series)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply K-means clustering
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(X_scaled)
# Plot results
plt.figure(figsize=(15, 10))
for i, cluster in enumerate(clusters):
plt.subplot(3, 4, i+1)
plt.plot(time_series[i])
plt.title(f'Series {i+1}, Cluster {cluster}')
plt.tight_layout()
plt.show()
Anomaly Detection in Time Series
We can detect anomalies in time series data using various techniques.
from sklearn.ensemble import IsolationForest
# Generate data with anomaly
np.random.seed(42)
n_points = 100
# Normal data
series = np.sin(np.linspace(0, 4 * np.pi, n_points)) * 5 + np.random.randn(n_points) * 0.5
# Add anomaly
series[50] = 20
# Prepare data
X = series.reshape(-1, 1)
# Fit isolation forest
model = IsolationForest(contamination=0.05, random_state=42)
anomalies = model.fit_predict(X)
# Plot results
plt.figure(figsize=(12, 6))
plt.plot(series, label='Time Series')
anomaly_indices = np.where(anomalies == -1)[0]
plt.scatter(anomaly_indices, series[anomaly_indices], color='red', label='Anomalies')
plt.title('Anomaly Detection in Time Series')
plt.legend()
plt.show()
Practice Case: Sales Forecasting
In this practice case, we'll use time series analysis to forecast sales data.
Step 1: Load and Explore Data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Load sales data
df = pd.read_csv('sales_data.csv', parse_dates=['date'], index_col='date')
# Explore data
print(df.head())
print(df.info())
# Plot sales data
plt.figure(figsize=(12, 6))
plt.plot(df['sales'])
plt.title('Sales Data')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.show()
Step 2: Data Preprocessing
# Check for missing values
print(df.isnull().sum())
# Fill missing values
df = df.fillna(method='ffill')
# Resample to monthly frequency
monthly_sales = df.resample('M').sum()
# Plot monthly sales
plt.figure(figsize=(12, 6))
plt.plot(monthly_sales['sales'])
plt.title('Monthly Sales')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.show()
Step 3: Decompose Time Series
from statsmodels.tsa.seasonal import seasonal_decompose # Decompose the time series result = seasonal_decompose(monthly_sales['sales'], model='additive', period=12) # Plot components plt.figure(figsize=(12, 10)) plt.subplot(411) plt.plot(monthly_sales['sales'], label='Original') plt.legend(loc='best') plt.subplot(412) plt.plot(result.trend, label='Trend') plt.legend(loc='best') plt.subplot(413) plt.plot(result.seasonal, label='Seasonality') plt.legend(loc='best') plt.subplot(414) plt.plot(result.resid, label='Residual') plt.legend(loc='best') plt.tight_layout() plt.show()
Step 4: Fit SARIMA Model
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.metrics import mean_squared_error
# Split data
train_size = int(len(monthly_sales) * 0.8)
train, test = monthly_sales['sales'][:train_size], monthly_sales['sales'][train_size:]
# Fit SARIMA model
model = SARIMAX(train, order=(1, 1, 1), seasonal_order=(1, 1, 1, 12))
model_fit = model.fit()
print(model_fit.summary())
# Make predictions
predictions = model_fit.forecast(steps=len(test))
# Calculate error
mse = mean_squared_error(test, predictions)
rmse = np.sqrt(mse)
print(f'RMSE: {rmse}')
# Plot results
plt.figure(figsize=(12, 6))
plt.plot(train, label='Train')
plt.plot(test, label='Test')
plt.plot(test.index, predictions, label='Predictions')
plt.title('SARIMA Model Forecast')
plt.legend()
plt.show()
Step 5: Fit Prophet Model
from prophet import Prophet
# Prepare data for Prophet
df_prophet = monthly_sales.reset_index()
df_prophet.columns = ['ds', 'y']
# Split data
train_size = int(len(df_prophet) * 0.8)
train_prophet, test_prophet = df_prophet[:train_size], df_prophet[train_size:]
# Fit Prophet model
model = Prophet()
model.fit(train_prophet)
# Make predictions
future = model.make_future_dataframe(periods=len(test_prophet), freq='M')
forecast = model.predict(future)
# Extract predictions
predictions = forecast['yhat'][-len(test_prophet):]
# Calculate error
mse = mean_squared_error(test_prophet['y'], predictions)
rmse = np.sqrt(mse)
print(f'RMSE: {rmse}')
# Plot results
fig = model.plot(forecast)
plt.title('Prophet Model Forecast')
plt.show()
# Plot components
fig2 = model.plot_components(forecast)
plt.show()
Interactive Exercises
Exercise 1: Time Series Exploration
Load the provided sales data, explore its characteristics, and identify any trends or seasonality patterns.
Start ExerciseExercise 2: ARIMA Model Tuning
Use the sales data to find the optimal ARIMA model parameters using grid search, and evaluate the model's performance.
Start ExerciseExercise 3: LSTM Forecasting
Implement an LSTM model to forecast the sales data, and compare its performance with ARIMA and Prophet models.
Start ExerciseExercise 4: Anomaly Detection
Add some anomalies to the sales data and use isolation forest to detect them.
Start Exercise