Supervised Learning Overview
Supervised learning is an important branch of machine learning that uses labeled training data to learn the mapping relationship from inputs to outputs. In supervised learning, each training sample includes input features and corresponding target labels.
Supervised learning is mainly divided into two categories:
- Classification: Predicting discrete class labels, such as spam detection, image classification, etc.
- Regression: Predicting continuous numerical outputs, such as house price prediction, stock price prediction, etc.
Classification Algorithms
1. Logistic Regression
Logistic regression is a linear model used for binary classification problems. It uses the sigmoid function to map the linear combination results to the [0, 1] interval, representing the probability that a sample belongs to the positive class.
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load dataset
iris = load_iris()
X = iris.data[:, :2] # Use only the first two features
y = (iris.target == 0).astype(int) # Binary classification: Is it a setosa?
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and train model
model = LogisticRegression()
model.fit(X_train, y_train)
# Predict
y_pred = model.predict(X_test)
# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
2. Decision Tree
Decision tree is a tree-based classification algorithm that recursively divides the dataset into different subsets until each subset contains only samples of the same class.
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
# Load dataset
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and train model
tree_model = DecisionTreeClassifier(max_depth=3, random_state=42)
tree_model.fit(X_train, y_train)
# Predict
y_pred = tree_model.predict(X_test)
# Evaluate
print(classification_report(y_test, y_pred, target_names=cancer.target_names))
3. Random Forest
Random forest is an ensemble learning method composed of multiple decision trees. It improves prediction performance and reduces overfitting through voting or averaging.
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
# Load dataset
digits = load_digits()
X = digits.data
y = digits.target
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and train model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
# Predict
y_pred = rf_model.predict(X_test)
# Evaluate
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()
4. Support Vector Machine (SVM)
Support Vector Machine is a powerful classification algorithm that finds the optimal hyperplane to maximize the margin between different classes.
from sklearn.svm import SVC
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
# Load dataset
wine = load_wine()
X = wine.data
y = wine.target
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and train model
pipeline = Pipeline([
('scaler', StandardScaler()),
('svm', SVC(kernel='rbf', C=1.0, gamma='scale'))
])
pipeline.fit(X_train, y_train)
# Predict
y_pred = pipeline.predict(X_test)
# Evaluate
accuracy = pipeline.score(X_test, y_test)
print(f"Accuracy: {accuracy:.2f}")
Regression Algorithms
1. Linear Regression
Linear regression is a linear model used for predicting continuous values. It assumes a linear relationship between input features and the target variable.
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Load dataset
boston = load_boston()
X = boston.data
y = boston.target
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and train model
model = LinearRegression()
model.fit(X_train, y_train)
# Predict
y_pred = model.predict(X_test)
# Evaluate
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"MSE: {mse:.2f}")
print(f"R²: {r2:.2f}")
2. Ridge Regression
Ridge regression is a regularized version of linear regression that reduces overfitting by adding an L2 regularization term.
from sklearn.linear_model import Ridge
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
# Load dataset
diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and train model
pipeline = Pipeline([
('scaler', StandardScaler()),
('ridge', Ridge(alpha=1.0))
])
pipeline.fit(X_train, y_train)
# Predict
y_pred = pipeline.predict(X_test)
# Evaluate
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"MSE: {mse:.2f}")
print(f"R²: {r2:.2f}")
3. Lasso Regression
Lasso regression is another regularized linear regression that uses L1 regularization and can automatically perform feature selection.
from sklearn.linear_model import Lasso
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
# Load dataset
diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and train model
pipeline = Pipeline([
('scaler', StandardScaler()),
('lasso', Lasso(alpha=0.1))
])
pipeline.fit(X_train, y_train)
# Predict
y_pred = pipeline.predict(X_test)
# Evaluate
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"MSE: {mse:.2f}")
print(f"R²: {r2:.2f}")
# View feature coefficients
lasso_model = pipeline.named_steps['lasso']
print("Feature coefficients:")
for i, coef in enumerate(lasso_model.coef_):
print(f"Feature {i}: {coef:.4f}")
4. Decision Tree Regression
Decision tree regression is a tree-based regression algorithm that divides the feature space into multiple regions and predicts a constant value for each region.
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Load dataset
boston = load_boston()
X = boston.data
y = boston.target
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and train model
model = DecisionTreeRegressor(max_depth=5, random_state=42)
model.fit(X_train, y_train)
# Predict
y_pred = model.predict(X_test)
# Evaluate
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"MSE: {mse:.2f}")
print(f"R²: {r2:.2f}")
Model Evaluation
Classification Model Evaluation Metrics
- Accuracy: The proportion of correctly predicted samples to the total number of samples
- Precision: The proportion of samples predicted as positive that are actually positive
- Recall: The proportion of actually positive samples that are correctly predicted
- F1 Score: The harmonic mean of precision and recall
- Confusion Matrix: Shows the correspondence between model predictions and actual labels
- ROC Curve and AUC: Evaluate model performance at different thresholds
Regression Model Evaluation Metrics
- Mean Squared Error (MSE): The average of the squared differences between predicted and actual values
- Root Mean Squared Error (RMSE): The square root of MSE
- Mean Absolute Error (MAE): The average of the absolute differences between predicted and actual values
- R² Score: The proportion of variance explained by the model
Practice Case: House Price Prediction
In this practice case, we will use the Boston Housing dataset to predict house prices. We will try multiple regression algorithms and compare their performance.
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
# Load dataset
boston = load_boston()
X = boston.data
y = boston.target
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Data standardization
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Define models
models = {
'Linear Regression': LinearRegression(),
'Ridge Regression': Ridge(alpha=1.0),
'Lasso Regression': Lasso(alpha=0.1),
'Decision Tree': DecisionTreeRegressor(max_depth=5, random_state=42),
'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42)
}
# Train and evaluate models
results = {}
for name, model in models.items():
# Train model
model.fit(X_train_scaled, y_train)
# Predict
y_pred = model.predict(X_test_scaled)
# Evaluate
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
# Cross-validation
cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring='r2')
cv_mean = cv_scores.mean()
results[name] = {
'MSE': mse,
'RMSE': rmse,
'R²': r2,
'CV Mean R²': cv_mean
}
# Print results
print("Model Performance Comparison:")
print("-" * 80)
for name, metrics in results.items():
print(f"{name}:")
print(f" MSE: {metrics['MSE']:.2f}")
print(f" RMSE: {metrics['RMSE']:.2f}")
print(f" R²: {metrics['R²']:.2f}")
print(f" CV Mean R²: {metrics['CV Mean R²']:.2f}")
print("-" * 80)
Interactive Exercises
Exercise 1: Classification Model Comparison
Using scikit-learn's digits dataset, compare the performance of logistic regression, SVM, and random forest classification algorithms.
- Load the digits dataset
- Split into training and test sets (80% training, 20% test)
- Standardize the data
- Train the three classification models
- Calculate and compare accuracy, precision, recall, and F1 score
- Plot confusion matrices
Exercise 2: Regularization Parameter Tuning
Using the diabetes dataset, tune the alpha parameter of the Ridge regression model.
- Load the diabetes dataset
- Split into training and test sets
- Use GridSearchCV or RandomizedSearchCV to search for the optimal alpha value
- Evaluate the performance of the optimal model
- Analyze the effect of different alpha values on model performance
Exercise 3: Feature Importance Analysis
Using the boston dataset, analyze the importance of each feature in a random forest regression model.
- Load the boston dataset
- Train a random forest regression model
- Get feature importance scores
- Sort features by importance
- Visualize feature importance
Recommended Tutorials
Data Preprocessing
Learn data cleaning, feature scaling, feature encoding, and data splitting techniques
View TutorialUnsupervised Learning
Explore clustering, dimensionality reduction, and anomaly detection algorithms
View Tutorial