Python Machine Learning - Supervised Learning

Supervised Learning Overview

Supervised learning is an important branch of machine learning that uses labeled training data to learn the mapping relationship from inputs to outputs. In supervised learning, each training sample includes input features and corresponding target labels.

Supervised learning is mainly divided into two categories:

Classification: Predicting discrete class labels, such as spam detection, image classification, etc.
Regression: Predicting continuous numerical outputs, such as house price prediction, stock price prediction, etc.

Classification Algorithms

1. Logistic Regression

Logistic regression is a linear model used for binary classification problems. It uses the sigmoid function to map the linear combination results to the [0, 1] interval, representing the probability that a sample belongs to the positive class.

                        from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X = iris.data[:, :2]  # Use only the first two features
y = (iris.target == 0).astype(int)  # Binary classification: Is it a setosa?

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train model
model = LogisticRegression()
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
                    

2. Decision Tree

Decision tree is a tree-based classification algorithm that recursively divides the dataset into different subsets until each subset contains only samples of the same class.

                        from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Load dataset
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train model
tree_model = DecisionTreeClassifier(max_depth=3, random_state=42)
tree_model.fit(X_train, y_train)

# Predict
y_pred = tree_model.predict(X_test)

# Evaluate
print(classification_report(y_test, y_pred, target_names=cancer.target_names))
                    

3. Random Forest

Random forest is an ensemble learning method composed of multiple decision trees. It improves prediction performance and reduces overfitting through voting or averaging.

                        from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Load dataset
digits = load_digits()
X = digits.data
y = digits.target

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Predict
y_pred = rf_model.predict(X_test)

# Evaluate
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()
                    

4. Support Vector Machine (SVM)

Support Vector Machine is a powerful classification algorithm that finds the optimal hyperplane to maximize the margin between different classes.

                        from sklearn.svm import SVC
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Load dataset
wine = load_wine()
X = wine.data
y = wine.target

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train model
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC(kernel='rbf', C=1.0, gamma='scale'))
])
pipeline.fit(X_train, y_train)

# Predict
y_pred = pipeline.predict(X_test)

# Evaluate
accuracy = pipeline.score(X_test, y_test)
print(f"Accuracy: {accuracy:.2f}")
                    

Regression Algorithms

1. Linear Regression

Linear regression is a linear model used for predicting continuous values. It assumes a linear relationship between input features and the target variable.

                        from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Load dataset
boston = load_boston()
X = boston.data
y = boston.target

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"MSE: {mse:.2f}")
print(f"R²: {r2:.2f}")
                    

2. Ridge Regression

Ridge regression is a regularized version of linear regression that reduces overfitting by adding an L2 regularization term.

                        from sklearn.linear_model import Ridge
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Load dataset
diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train model
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('ridge', Ridge(alpha=1.0))
])
pipeline.fit(X_train, y_train)

# Predict
y_pred = pipeline.predict(X_test)

# Evaluate
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"MSE: {mse:.2f}")
print(f"R²: {r2:.2f}")
                    

3. Lasso Regression

Lasso regression is another regularized linear regression that uses L1 regularization and can automatically perform feature selection.

                        from sklearn.linear_model import Lasso
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Load dataset
diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train model
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('lasso', Lasso(alpha=0.1))
])
pipeline.fit(X_train, y_train)

# Predict
y_pred = pipeline.predict(X_test)

# Evaluate
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"MSE: {mse:.2f}")
print(f"R²: {r2:.2f}")

# View feature coefficients
lasso_model = pipeline.named_steps['lasso']
print("Feature coefficients:")
for i, coef in enumerate(lasso_model.coef_):
    print(f"Feature {i}: {coef:.4f}")
                    

4. Decision Tree Regression

Decision tree regression is a tree-based regression algorithm that divides the feature space into multiple regions and predicts a constant value for each region.

                        from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Load dataset
boston = load_boston()
X = boston.data
y = boston.target

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train model
model = DecisionTreeRegressor(max_depth=5, random_state=42)
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"MSE: {mse:.2f}")
print(f"R²: {r2:.2f}")
                    

Model Evaluation

Classification Model Evaluation Metrics

Accuracy: The proportion of correctly predicted samples to the total number of samples
Precision: The proportion of samples predicted as positive that are actually positive
Recall: The proportion of actually positive samples that are correctly predicted
F1 Score: The harmonic mean of precision and recall
Confusion Matrix: Shows the correspondence between model predictions and actual labels
ROC Curve and AUC: Evaluate model performance at different thresholds

Regression Model Evaluation Metrics

Mean Squared Error (MSE): The average of the squared differences between predicted and actual values
Root Mean Squared Error (RMSE): The square root of MSE
Mean Absolute Error (MAE): The average of the absolute differences between predicted and actual values
R² Score: The proportion of variance explained by the model

Practice Case: House Price Prediction

In this practice case, we will use the Boston Housing dataset to predict house prices. We will try multiple regression algorithms and compare their performance.

                        from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Load dataset
boston = load_boston()
X = boston.data
y = boston.target

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Data standardization
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Define models
models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(alpha=1.0),
    'Lasso Regression': Lasso(alpha=0.1),
    'Decision Tree': DecisionTreeRegressor(max_depth=5, random_state=42),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42)
}

# Train and evaluate models
results = {}
for name, model in models.items():
    # Train model
    model.fit(X_train_scaled, y_train)
    
    # Predict
    y_pred = model.predict(X_test_scaled)
    
    # Evaluate
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_test, y_pred)
    
    # Cross-validation
    cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring='r2')
    cv_mean = cv_scores.mean()
    
    results[name] = {
        'MSE': mse,
        'RMSE': rmse,
        'R²': r2,
        'CV Mean R²': cv_mean
    }

# Print results
print("Model Performance Comparison:")
print("-" * 80)
for name, metrics in results.items():
    print(f"{name}:")
    print(f"  MSE: {metrics['MSE']:.2f}")
    print(f"  RMSE: {metrics['RMSE']:.2f}")
    print(f"  R²: {metrics['R²']:.2f}")
    print(f"  CV Mean R²: {metrics['CV Mean R²']:.2f}")
    print("-" * 80)
                    

Interactive Exercises

Exercise 1: Classification Model Comparison

Using scikit-learn's digits dataset, compare the performance of logistic regression, SVM, and random forest classification algorithms.

Load the digits dataset
Split into training and test sets (80% training, 20% test)
Standardize the data
Train the three classification models
Calculate and compare accuracy, precision, recall, and F1 score
Plot confusion matrices

Exercise 2: Regularization Parameter Tuning

Using the diabetes dataset, tune the alpha parameter of the Ridge regression model.

Load the diabetes dataset
Split into training and test sets
Use GridSearchCV or RandomizedSearchCV to search for the optimal alpha value
Evaluate the performance of the optimal model
Analyze the effect of different alpha values on model performance

Exercise 3: Feature Importance Analysis

Using the boston dataset, analyze the importance of each feature in a random forest regression model.

Load the boston dataset
Train a random forest regression model
Get feature importance scores
Sort features by importance
Visualize feature importance

Python Machine Learning - Supervised Learning

Supervised Learning Overview

Classification Algorithms

1. Logistic Regression

2. Decision Tree

3. Random Forest

4. Support Vector Machine (SVM)

Regression Algorithms

1. Linear Regression

2. Ridge Regression

3. Lasso Regression

4. Decision Tree Regression

Model Evaluation

Classification Model Evaluation Metrics

Regression Model Evaluation Metrics

Practice Case: House Price Prediction

Interactive Exercises

Exercise 1: Classification Model Comparison

Exercise 2: Regularization Parameter Tuning

Exercise 3: Feature Importance Analysis

Recommended Tutorials

Data Preprocessing

Unsupervised Learning

Model Evaluation