Supervised Learning Overview

Supervised learning is an important branch of machine learning that uses labeled training data to learn the mapping relationship from inputs to outputs. In supervised learning, each training sample includes input features and corresponding target labels.

Supervised learning is mainly divided into two categories:

  • Classification: Predicting discrete class labels, such as spam detection, image classification, etc.
  • Regression: Predicting continuous numerical outputs, such as house price prediction, stock price prediction, etc.

Classification Algorithms

1. Logistic Regression

Logistic regression is a linear model used for binary classification problems. It uses the sigmoid function to map the linear combination results to the [0, 1] interval, representing the probability that a sample belongs to the positive class.

from sklearn.linear_model import LogisticRegression from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # Load dataset iris = load_iris() X = iris.data[:, :2] # Use only the first two features y = (iris.target == 0).astype(int) # Binary classification: Is it a setosa? # Split into training and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Create and train model model = LogisticRegression() model.fit(X_train, y_train) # Predict y_pred = model.predict(X_test) # Evaluate accuracy = accuracy_score(y_test, y_pred) print(f"Accuracy: {accuracy:.2f}")

2. Decision Tree

Decision tree is a tree-based classification algorithm that recursively divides the dataset into different subsets until each subset contains only samples of the same class.

from sklearn.tree import DecisionTreeClassifier from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report # Load dataset cancer = load_breast_cancer() X = cancer.data y = cancer.target # Split into training and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Create and train model tree_model = DecisionTreeClassifier(max_depth=3, random_state=42) tree_model.fit(X_train, y_train) # Predict y_pred = tree_model.predict(X_test) # Evaluate print(classification_report(y_test, y_pred, target_names=cancer.target_names))

3. Random Forest

Random forest is an ensemble learning method composed of multiple decision trees. It improves prediction performance and reduces overfitting through voting or averaging.

from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import load_digits from sklearn.model_selection import train_test_split from sklearn.metrics import confusion_matrix import seaborn as sns import matplotlib.pyplot as plt # Load dataset digits = load_digits() X = digits.data y = digits.target # Split into training and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Create and train model rf_model = RandomForestClassifier(n_estimators=100, random_state=42) rf_model.fit(X_train, y_train) # Predict y_pred = rf_model.predict(X_test) # Evaluate cm = confusion_matrix(y_test, y_pred) plt.figure(figsize=(10, 8)) sns.heatmap(cm, annot=True, fmt='d', cmap='Blues') plt.title('Confusion Matrix') plt.xlabel('Predicted') plt.ylabel('True') plt.show()

4. Support Vector Machine (SVM)

Support Vector Machine is a powerful classification algorithm that finds the optimal hyperplane to maximize the margin between different classes.

from sklearn.svm import SVC from sklearn.datasets import load_wine from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline # Load dataset wine = load_wine() X = wine.data y = wine.target # Split into training and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Create and train model pipeline = Pipeline([ ('scaler', StandardScaler()), ('svm', SVC(kernel='rbf', C=1.0, gamma='scale')) ]) pipeline.fit(X_train, y_train) # Predict y_pred = pipeline.predict(X_test) # Evaluate accuracy = pipeline.score(X_test, y_test) print(f"Accuracy: {accuracy:.2f}")

Regression Algorithms

1. Linear Regression

Linear regression is a linear model used for predicting continuous values. It assumes a linear relationship between input features and the target variable.

from sklearn.linear_model import LinearRegression from sklearn.datasets import load_boston from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error, r2_score # Load dataset boston = load_boston() X = boston.data y = boston.target # Split into training and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Create and train model model = LinearRegression() model.fit(X_train, y_train) # Predict y_pred = model.predict(X_test) # Evaluate mse = mean_squared_error(y_test, y_pred) r2 = r2_score(y_test, y_pred) print(f"MSE: {mse:.2f}") print(f"R²: {r2:.2f}")

2. Ridge Regression

Ridge regression is a regularized version of linear regression that reduces overfitting by adding an L2 regularization term.

from sklearn.linear_model import Ridge from sklearn.datasets import load_diabetes from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline # Load dataset diabetes = load_diabetes() X = diabetes.data y = diabetes.target # Split into training and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Create and train model pipeline = Pipeline([ ('scaler', StandardScaler()), ('ridge', Ridge(alpha=1.0)) ]) pipeline.fit(X_train, y_train) # Predict y_pred = pipeline.predict(X_test) # Evaluate mse = mean_squared_error(y_test, y_pred) r2 = r2_score(y_test, y_pred) print(f"MSE: {mse:.2f}") print(f"R²: {r2:.2f}")

3. Lasso Regression

Lasso regression is another regularized linear regression that uses L1 regularization and can automatically perform feature selection.

from sklearn.linear_model import Lasso from sklearn.datasets import load_diabetes from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline # Load dataset diabetes = load_diabetes() X = diabetes.data y = diabetes.target # Split into training and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Create and train model pipeline = Pipeline([ ('scaler', StandardScaler()), ('lasso', Lasso(alpha=0.1)) ]) pipeline.fit(X_train, y_train) # Predict y_pred = pipeline.predict(X_test) # Evaluate mse = mean_squared_error(y_test, y_pred) r2 = r2_score(y_test, y_pred) print(f"MSE: {mse:.2f}") print(f"R²: {r2:.2f}") # View feature coefficients lasso_model = pipeline.named_steps['lasso'] print("Feature coefficients:") for i, coef in enumerate(lasso_model.coef_): print(f"Feature {i}: {coef:.4f}")

4. Decision Tree Regression

Decision tree regression is a tree-based regression algorithm that divides the feature space into multiple regions and predicts a constant value for each region.

from sklearn.tree import DecisionTreeRegressor from sklearn.datasets import load_boston from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error, r2_score # Load dataset boston = load_boston() X = boston.data y = boston.target # Split into training and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Create and train model model = DecisionTreeRegressor(max_depth=5, random_state=42) model.fit(X_train, y_train) # Predict y_pred = model.predict(X_test) # Evaluate mse = mean_squared_error(y_test, y_pred) r2 = r2_score(y_test, y_pred) print(f"MSE: {mse:.2f}") print(f"R²: {r2:.2f}")

Model Evaluation

Classification Model Evaluation Metrics

  • Accuracy: The proportion of correctly predicted samples to the total number of samples
  • Precision: The proportion of samples predicted as positive that are actually positive
  • Recall: The proportion of actually positive samples that are correctly predicted
  • F1 Score: The harmonic mean of precision and recall
  • Confusion Matrix: Shows the correspondence between model predictions and actual labels
  • ROC Curve and AUC: Evaluate model performance at different thresholds

Regression Model Evaluation Metrics

  • Mean Squared Error (MSE): The average of the squared differences between predicted and actual values
  • Root Mean Squared Error (RMSE): The square root of MSE
  • Mean Absolute Error (MAE): The average of the absolute differences between predicted and actual values
  • R² Score: The proportion of variance explained by the model

Practice Case: House Price Prediction

In this practice case, we will use the Boston Housing dataset to predict house prices. We will try multiple regression algorithms and compare their performance.

from sklearn.datasets import load_boston from sklearn.model_selection import train_test_split, cross_val_score from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LinearRegression, Ridge, Lasso from sklearn.tree import DecisionTreeRegressor from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_squared_error, r2_score import numpy as np # Load dataset boston = load_boston() X = boston.data y = boston.target # Split into training and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Data standardization scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # Define models models = { 'Linear Regression': LinearRegression(), 'Ridge Regression': Ridge(alpha=1.0), 'Lasso Regression': Lasso(alpha=0.1), 'Decision Tree': DecisionTreeRegressor(max_depth=5, random_state=42), 'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42) } # Train and evaluate models results = {} for name, model in models.items(): # Train model model.fit(X_train_scaled, y_train) # Predict y_pred = model.predict(X_test_scaled) # Evaluate mse = mean_squared_error(y_test, y_pred) rmse = np.sqrt(mse) r2 = r2_score(y_test, y_pred) # Cross-validation cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring='r2') cv_mean = cv_scores.mean() results[name] = { 'MSE': mse, 'RMSE': rmse, 'R²': r2, 'CV Mean R²': cv_mean } # Print results print("Model Performance Comparison:") print("-" * 80) for name, metrics in results.items(): print(f"{name}:") print(f" MSE: {metrics['MSE']:.2f}") print(f" RMSE: {metrics['RMSE']:.2f}") print(f" R²: {metrics['R²']:.2f}") print(f" CV Mean R²: {metrics['CV Mean R²']:.2f}") print("-" * 80)

Interactive Exercises

Exercise 1: Classification Model Comparison

Using scikit-learn's digits dataset, compare the performance of logistic regression, SVM, and random forest classification algorithms.

  1. Load the digits dataset
  2. Split into training and test sets (80% training, 20% test)
  3. Standardize the data
  4. Train the three classification models
  5. Calculate and compare accuracy, precision, recall, and F1 score
  6. Plot confusion matrices

Exercise 2: Regularization Parameter Tuning

Using the diabetes dataset, tune the alpha parameter of the Ridge regression model.

  1. Load the diabetes dataset
  2. Split into training and test sets
  3. Use GridSearchCV or RandomizedSearchCV to search for the optimal alpha value
  4. Evaluate the performance of the optimal model
  5. Analyze the effect of different alpha values on model performance

Exercise 3: Feature Importance Analysis

Using the boston dataset, analyze the importance of each feature in a random forest regression model.

  1. Load the boston dataset
  2. Train a random forest regression model
  3. Get feature importance scores
  4. Sort features by importance
  5. Visualize feature importance

Recommended Tutorials

Data Preprocessing

Learn data cleaning, feature scaling, feature encoding, and data splitting techniques

View Tutorial

Unsupervised Learning

Explore clustering, dimensionality reduction, and anomaly detection algorithms

View Tutorial

Model Evaluation

Deep dive into model evaluation metrics and model selection methods

View Tutorial