Python ML Data Preprocessing - Geek Tutorial Network

1. Introduction to Data Preprocessing

Data preprocessing is a critical step in the machine learning workflow. It involves transforming raw data into a clean, structured format that can be effectively used by machine learning algorithmss. Proper data preprocessing can significantly improve model performance and accuracy.

1.1 Why Data Preprocessing is Important

Improves Model Performance: Clean, well-structured data allows models to learn patterns more effectively.
processings Missing Values: Missing data can cause models to fail or produce inaccurate results.
Normalizes Data: Features on different scales can bias models toward features with larger values.
Reduces Noise: Removes irrelevant or redundant information from the data.
Enhances Interpretability: Well-processed data makes model decisions more understandable.

1.2 Common Data Preprocessing Steps

Data Cleaning: Handling missing values, removing duplicates, correcting errors
Feature Scaling: Normalizing or standardizing features
Feature Encoding: Converting categorical data to numerical format
Feature Selection: Identifying and using only relevant features
Feature Transformation: Creating new features from existing ones
Data Splitting: Dividing data into training, validation, and test sets

2. Data Cleaning

Data cleaning is the process of identifying and correcting errors, inconsistencies, and missing values in the dataset. It is one of the most time-consuming but essential steps in data preprocessing.

2.1 Handling Missing Values

Missing values can occur for various reasons, such as data entry errors, sensor failures, or incomplete surveys. There are several strategies to handle missing values:

2.1.1 Deletion

Remove rows or columns with missing values. This is only recommended when the missing data is minimal and randomly distributed.

2.1.2 Imputation

Replace missing values with estimated values based on the available data.

import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer

# Create sample data with missing values
data = pd.DataFrame({
    'age': [25, np.nan, 30, 35, np.nan, 40],
    'income': [50000, 60000, np.nan, 80000, 90000, 100000],
    'score': [85, 90, 75, np.nan, 80, 95]
})

print("Original Data:")
print(data)

# Check for missing values
print("\nMissing Values:")
print(data.isnull().sum())

# 1. Deletion
# Drop rows with any missing values
data_dropna = data.dropna()
print("\nAfter Dropping Rows:")
print(data_dropna)

# Drop columns with any missing values
data_dropcols = data.dropna(axis=1)
print("\nAfter Dropping Columns:")
print(data_dropcols)

# 2. Imputation
# Replace with mean
imputer_mean = SimpleImputer(strategy='mean')
data_imputed_mean = pd.DataFrame(imputer_mean.fit_transform(data), columns=data.columns)
print("\nAfter Mean Imputation:")
print(data_imputed_mean)

# Replace with median
imputer_median = SimpleImputer(strategy='median')
data_imputed_median = pd.DataFrame(imputer_median.fit_transform(data), columns=data.columns)
print("\nAfter Median Imputation:")
print(data_imputed_median)

# Replace with most frequent value
imputer_mode = SimpleImputer(strategy='most_frequent')
data_imputed_mode = pd.DataFrame(imputer_mode.fit_transform(data), columns=data.columns)
print("\nAfter Mode Imputation:")
print(data_imputed_mode)

# Replace with constant value
imputer_constant = SimpleImputer(strategy='constant', fill_value=0)
data_imputed_constant = pd.DataFrame(imputer_constant.fit_transform(data), columns=data.columns)
print("\nAfter Constant Imputation:")
print(data_imputed_constant)

2.2 Removing Duplicates

Duplicate rows can skew model training by giving more weight to repeated information. It's important to identify and remove duplicate entries.

import pandas as pd

# Create sample data with duplicates
data = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob'],
    'age': [25, 30, 35, 25, 30],
    'score': [85, 90, 75, 85, 90]
})

print("Original Data:")
print(data)

# Check for duplicates
print("\nDuplicate Rows:")
print(data.duplicated())

# Count duplicates
print("\nNumber of Duplicate Rows:")
print(data.duplicated().sum())

# Remove duplicates
data_no_duplicates = data.drop_duplicates()
print("\nAfter Removing Duplicates:")
print(data_no_duplicates)

# Remove duplicates based on specific columns
data_no_duplicates_name = data.drop_duplicates(subset=['name'])
print("\nAfter Removing Duplicates by Name:")
print(data_no_duplicates_name)

2.3 Handling Outliers

Outliers are data points that deviate significantly from the rest of the dataset. They can be caused by measurement errors or genuine anomalies. Handling outliers appropriately is important for model performance.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Create sample data with outliers
data = pd.DataFrame({
    'value': [10, 12, 14, 15, 16, 18, 20, 22, 24, 100]  # 100 is an outlier
})

print("Original Data:")
print(data)

# Visualize with box plot
plt.figure(figsize=(10, 6))
plt.boxplot(data['value'])
plt.title('Box Plot to Identify Outliers')
plt.ylabel('Value')
plt.show()

# 1. IQR Method
Q1 = data['value'].quantile(0.25)
Q3 = data['value'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"\nQ1: {Q1}")
print(f"Q3: {Q3}")
print(f"IQR: {IQR}")
print(f"Lower Bound: {lower_bound}")
print(f"Upper Bound: {upper_bound}")

# Identify outliers
outliers = data[(data['value'] < lower_bound) | (data['value'] > upper_bound)]
print("\nOutliers:")
print(outliers)

# Remove outliers
data_no_outliers = data[(data['value'] >= lower_bound) & (data['value'] <= upper_bound)]
print("\nAfter Removing Outliers:")
print(data_no_outliers)

# 2. Z-Score Method
from scipy import stats

z_scores = np.abs(stats.zscore(data['value']))
threshold = 3

print("\nZ-Scores:")
print(z_scores)

# Identify outliers
outliers_z = data[z_scores > threshold]
print("\nOutliers (Z-Score > 3):")
print(outliers_z)

# Remove outliers
data_no_outliers_z = data[z_scores <= threshold]
print("\nAfter Removing Outliers (Z-Score):")
print(data_no_outliers_z)

3. Feature Scaling

Feature scale is the process of normalizing or standardizing features to ensure they are on a similar scale. This is important because many machine learning algorithmss are sensitive to the scale of input features.

3.1 Types of Feature Scaling

3.1.1 Standardization (Z-Score Normalization)

Transforms features to have a mean of 0 and a standard deviation of 1. This is useful when features follow a Gaussian distribution.

3.1.2 Min-Max Scaling (Normalization)

Scales features to a fixed range, typically [0, 1] or [-1, 1]. This is useful when features have bounded ranges.

3.1.3 Robust Scaling

Uses median and IQR to scale features, making it robust to outliers.

3.1.4 MaxAbs Scaling

Scales features by their maximum absolute value, preserving the sign of the data.

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler, MaxAbsScaler

# Create sample data
np.random.seed(42)
data = pd.DataFrame({
    'feature1': np.random.normal(100, 10, 10),  # Mean=100, Std=10
    'feature2': np.random.uniform(0, 100, 10),  # Range [0, 100]
    'feature3': np.random.exponential(5, 10)  # Exponential distribution
})

print("Original Data:")
print(data)
print("\nOriginal Statistics:")
print(data.describe())

# 1. Standardization
scaler_standard = StandardScaler()
data_standard = pd.DataFrame(scaler_standard.fit_transform(data), columns=data.columns)
print("\nAfter Standardization:")
print(data_standard)
print("\nStatistics After Standardization:")
print(data_standard.describe())

# 2. Min-Max Scaling
scaler_minmax = MinMaxScaler()
data_minmax = pd.DataFrame(scaler_minmax.fit_transform(data), columns=data.columns)
print("\nAfter Min-Max Scaling:")
print(data_minmax)
print("\nStatistics After Min-Max Scaling:")
print(data_minmax.describe())

# 3. Robust Scaling
scaler_robust = RobustScaler()
data_robust = pd.DataFrame(scaler_robust.fit_transform(data), columns=data.columns)
print("\nAfter Robust Scaling:")
print(data_robust)
print("\nStatistics After Robust Scaling:")
print(data_robust.describe())

# 4. MaxAbs Scaling
scaler_maxabs = MaxAbsScaler()
data_maxabs = pd.DataFrame(scaler_maxabs.fit_transform(data), columns=data.columns)
print("\nAfter MaxAbs Scaling:")
print(data_maxabs)
print("\nStatistics After MaxAbs Scaling:")
print(data_maxabs.describe())

4. Feature Encoding

Feature encoding is the process of converting categorical data into numerical format that can be used by machine learning algorithmss. Most algorithmss require numerical inputs, so categorical features must be properly encoded.

4.1 Types of Categorical Data

Nominal: Categories with no inherent order (e.g., colors, countries)
Ordinal: Categories with a specific order (e.g., ratings, education levels)

4.2 Encoding Techniques

4.2.1 One-Hot Encoding

Creates a binary column for each category. Suitable for nominal data with a small number of categories.

4.2.2 Label Encoding

Assigns a unique integer to each category. Suitable for ordinal data.

4.2.3 Ordinal Encoding

Similar to label encoding but allows specifying the order of categories explicitly.

4.2.4 Binary Encoding

Converts categories to binary code, reducing dimensionality compared to one-hot encoding.

import pandas as pd
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, OrdinalEncoder
import category_encoders as ce

# Create sample data
data = pd.DataFrame({
    'color': ['red', 'blue', 'green', 'red', 'green'],
    'size': ['S', 'M', 'L', 'M', 'XL'],
    'rating': ['poor', 'good', 'excellent', 'good', 'excellent']
})

print("Original Data:")
print(data)

# 1. One-Hot Encoding
print("\n1. One-Hot Encoding:")
data_onehot = pd.get_dummies(data, columns=['color'])
print(data_onehot)

# Using scikit-learn OneHotEncoder
encoder_onehot = OneHotEncoder(sparse_output=False)
color_encoded = encoder_onehot.fit_transform(data[['color']])
color_df = pd.DataFrame(color_encoded, columns=encoder_onehot.get_feature_names_out(['color']))
data_onehot_sklearn = pd.concat([data, color_df], axis=1).drop('color', axis=1)
print("\nOne-Hot Encoding with scikit-learn:")
print(data_onehot_sklearn)

# 2. Label Encoding
print("\n2. Label Encoding:")
label_encoder = LabelEncoder()
data_label = data.copy()
data_label['color_encoded'] = label_encoder.fit_transform(data['color'])
print(data_label)

# 3. Ordinal Encoding
print("\n3. Ordinal Encoding:")
# Define the order for size and rating
size_order = ['S', 'M', 'L', 'XL']
rating_order = ['poor', 'good', 'excellent']

ordinal_encoder = OrdinalEncoder(categories=[size_order, rating_order])
data_ordinal = data.copy()
data_ordinal[['size_encoded', 'rating_encoded']] = ordinal_encoder.fit_transform(data[['size', 'rating']])
print(data_ordinal)

# 4. Binary Encoding
print("\n4. Binary Encoding:")
binary_encoder = ce.BinaryEncoder(cols=['color'])
data_binary = binary_encoder.fit_transform(data)
print(data_binary)

5. Feature Selection and Transformation

Feature selection involves identifying and using only the most relevant features for model training. Feature transformation involves creating new features from existing ones to improve model performance.

5.1 Feature Selection Methods

5.1.1 Filter Methods

Evaluate features based on statistical tests or correlation with the target variable.

5.1.2 Wrapper Methods

Evaluate feature subsets by training models and measuring performance.

5.1.3 Embedded Methods

Select features during model training (e.g., LASSO regularization).

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectKBest, chi2, RFE, SelectFromModel
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Load sample data
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

print(f"Original number of features: {X.shape[1]}")
print(f"Feature names: {X.columns.tolist()}")

# 1. Filter Method: SelectKBest
print("\n1. Filter Method: SelectKBest")
selector_kbest = SelectKBest(chi2, k=10)
X_kbest = selector_kbest.fit_transform(X, y)
selected_features_kbest = X.columns[selector_kbest.get_support()]
print(f"Selected features: {selected_features_kbest.tolist()}")

# 2. Wrapper Method: Recursive Feature Elimination (RFE)
print("\n2. Wrapper Method: RFE")
model = LogisticRegression(max_iter=10000)
rfe = RFE(model, n_features_to_select=10)
X_rfe = rfe.fit_transform(X, y)
selected_features_rfe = X.columns[rfe.get_support()]
print(f"Selected features: {selected_features_rfe.tolist()}")

# 3. Embedded Method: Feature Importance from Random Forest
print("\n3. Embedded Method: Feature Importance")
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

# Get feature importances
feature_importances = pd.Series(rf.feature_importances_, index=X.columns)
feature_importances = feature_importances.sort_values(ascending=False)
print("Feature Importances:")
print(feature_importances.head(10))

# Select top 10 features
selected_features_rf = feature_importances.head(10).index.tolist()
print(f"\nSelected features: {selected_features_rf}")

# Using SelectFromModel
selector_rf = SelectFromModel(rf, max_features=10, threshold=-np.inf)
X_rf = selector_rf.fit_transform(X, y)
selected_features_sfm = X.columns[selector_rf.get_support()]
print(f"\nSelected features with SelectFromModel: {selected_features_sfm.tolist()}")

5.2 Feature Transformation

Feature transformation involves creating new features from existing ones to capture complex relationships in the data.

import numpy as np
import pandas as pd

# Create sample data
data = pd.DataFrame({
    'age': [25, 30, 35, 40, 45],
    'income': [50000, 60000, 70000, 80000, 90000],
    'expenditure': [30000, 35000, 40000, 45000, 50000]
})

print("Original Data:")
print(data)

# 1. Polynomial Features
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2, include_bias=False)
poly_features = poly.fit_transform(data)
poly_feature_names = poly.get_feature_names_out(data.columns)
data_poly = pd.DataFrame(poly_features, columns=poly_feature_names)
print("\nAfter Polynomial Features:")
print(data_poly)

# 2. Interaction Features
# Create interaction between age and income
data['age_income_interaction'] = data['age'] * data['income']

# Create ratio feature
data['savings_ratio'] = (data['income'] - data['expenditure']) / data['income']

print("\nAfter Creating Custom Features:")
print(data)

# 3. Log Transformation
# Apply log transformation to income (useful for skewed data)
data['log_income'] = np.log(data['income'])
print("\nAfter Log Transformation:")
print(data[['income', 'log_income']])

# 4. Binning
# Create age bins
data['age_bin'] = pd.cut(data['age'], bins=[20, 30, 40, 50], labels=['young', 'middle', 'old'])
print("\nAfter Binning:")
print(data[['age', 'age_bin']])

# 5. One-Hot Encode Binned Feature
data_encoded = pd.get_dummies(data, columns=['age_bin'])
print("\nAfter One-Hot Encoding Binned Feature:")
print(data_encoded)

6. Data Splitting

Data splitting is the process of dividing the dataset into separate subsets for training, validation, and testing. This is essential for evaluating model performance and preventing overfitting.

6.1 Common Splitting Strategies

6.1.1 Train-Test Split

Divide data into training (typically 70-80%) and testing (20-30%) sets.

6.1.2 Train-Validation-Test Split

Divide data into three sets: training, validation, and testing.

6.1.3 Cross-Validation

Use k-fold cross-validation to evaluate model performance more robustly.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, KFold, StratifiedKFold
from sklearn.datasets import load_iris

# Load sample data
data = load_iris()
X, y = data.data, data.target
print(f"Dataset size: {X.shape}")

# 1. Train-Test Split
print("\n1. Train-Test Split")
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Train size: {X_train.shape}, Test size: {X_test.shape}")
print(f"Train class distribution: {np.bincount(y_train)}")
print(f"Test class distribution: {np.bincount(y_test)}")

# 2. Train-Validation-Test Split
print("\n2. Train-Validation-Test Split")
# First split into train and temp
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.4, random_state=42, stratify=y
)
# Then split temp into validation and test
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp
)
print(f"Train size: {X_train.shape}, Validation size: {X_val.shape}, Test size: {X_test.shape}")

# 3. K-Fold Cross-Validation
print("\n3. K-Fold Cross-Validation")
kf = KFold(n_splits=5, shuffle=True, random_state=42)

fold = 1
for train_idx, test_idx in kf.split(X):
    X_train_kf, X_test_kf = X[train_idx], X[test_idx]
    y_train_kf, y_test_kf = y[train_idx], y[test_idx]
    print(f"Fold {fold}: Train size={X_train_kf.shape}, Test size={X_test_kf.shape}")
    fold += 1

# 4. Stratified K-Fold Cross-Validation
print("\n4. Stratified K-Fold Cross-Validation")
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

fold = 1
for train_idx, test_idx in skf.split(X, y):
    X_train_skf, X_test_skf = X[train_idx], X[test_idx]
    y_train_skf, y_test_skf = y[train_idx], y[test_idx]
    print(f"Fold {fold}: Train size={X_train_skf.shape}, Test size={X_test_skf.shape}")
    print(f"  Train class distribution: {np.bincount(y_train_skf)}")
    print(f"  Test class distribution: {np.bincount(y_test_skf)}")
    fold += 1

7. Practice Case

7.1 Case Objective

Preprocess the Titanic dataset for a classification task to predict passenger survival. This will help you practice the complete data preprocessing workflow, from cleaning to feature engineering.

7.2 Implementation Steps

7.2.1 Data Loading and Exploration

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_openml

# Load Titanic dataset
titanic = fetch_openml('titanic', version=1, as_frame=True)
df = titanic.frame

print("Dataset shape:", df.shape)
print("\nFirst 5 rows:")
print(df.head())
print("\nColumns:", df.columns.tolist())
print("\nData types:")
print(df.dtypes)

# Check for missing values
print("\nMissing values:")
print(df.isnull().sum())

# Check target variable distribution
print("\nSurvival distribution:")
print(df['survived'].value_counts())
print("\nSurvival percentage:")
print(df['survived'].value_counts(normalize=True))

# Visualize missing values
plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values Heatmap')
plt.show()

7.2.2 Data Cleaning

# processing missing values
# 1. Age: Impute with median
median_age = df['age'].median()
df['age'] = df['age'].fillna(median_age)

# 2. Fare: Impute with median
median_fare = df['fare'].median()
df['fare'] = df['fare'].fillna(median_fare)

# 3. Embarked: Impute with most frequent value
most_frequent_embarked = df['embarked'].mode()[0]
df['embarked'] = df['embarked'].fillna(most_frequent_embarked)

# 4. Cabin: Create a binary feature indicating if cabin is known
df['has_cabin'] = df['cabin'].notnull().astype(int)

# Drop cabin column as it has too many missing values
df = df.drop('cabin', axis=1)

# Check missing values after cleaning
print("\nMissing values after cleaning:")
print(df.isnull().sum())

# Remove duplicate rows
print("\nNumber of duplicate rows:", df.duplicated().sum())
df = df.drop_duplicates()
print("Dataset shape after removing duplicates:", df.shape)

# processing categorical variables
print("\nCategorical variables:")
categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
print(categorical_cols)

# Check unique values for each categorical column
for col in categorical_cols:
    print(f"\n{col} unique values:")
    print(df[col].unique())
    print(f"Number of unique values: {df[col].nunique()}")

7.2.3 Feature Engineering

# Create new features
# 1. Family size: siblings/spouses + parents/children + 1 (self)
df['family_size'] = df['sibsp'] + df['parch'] + 1

# 2. Is alone: 1 if family_size == 1, 0 otherwise
df['is_alone'] = (df['family_size'] == 1).astype(int)

# 3. Fare per person: fare divided by family size
df['fare_per_person'] = df['fare'] / df['family_size']

# 4. Age group: Create age bins
df['age_group'] = pd.cut(df['age'], bins=[0, 12, 18, 35, 60, 100], labels=['child', 'teenager', 'adult', 'middle_aged', 'senior'])

# 5. Title: Extract title from name
import re
def extract_title(name):
    title_search = re.search(' ([A-Za-z]+)\.', name)
    if title_search:
        return title_search.group(1)
    return ""

df['title'] = df['name'].apply(extract_title)

# Group rare titles
print("\nTitle distribution:")
print(df['title'].value_counts())

# Replace rare titles with 'Other'
rare_titles = ['Lady', 'Countess','Capt', 'Col', 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona']
df['title'] = df['title'].replace(rare_titles, 'Other')

# Check title distribution after grouping
print("\nTitle distribution after grouping:")
print(df['title'].value_counts())

# Drop unnecessary columns
df = df.drop(['name', 'ticket', 'boat', 'body', 'home.dest'], axis=1)
print("\nDataset shape after feature engineering:")
print(df.shape)
print("\nFinal columns:", df.columns.tolist())

7.2.4 Feature Encoding and Scaling

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Separate features and target
target = df['survived'].astype(int)
features = df.drop('survived', axis=1)

# Define categorical and numerical columns
categorical_cols = ['sex', 'embarked', 'age_group', 'title']
numerical_cols = ['pclass', 'age', 'sibsp', 'parch', 'fare', 'has_cabin', 'family_size', 'is_alone', 'fare_per_person']

# Create preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_cols),
        ('cat', OneHotEncoder(drop='first'), categorical_cols)
    ])

# Fit and transform the data
X_preprocessed = preprocessor.fit_transform(features)
y_preprocessed = target.values

print(f"\nPreprocessed features shape: {X_preprocessed.shape}")
print(f"Target shape: {y_preprocessed.shape}")

# Split into train and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_preprocessed, y_preprocessed, test_size=0.2, random_state=42, stratify=y_preprocessed
)

print(f"\nTrain size: {X_train.shape}")
print(f"Test size: {X_test.shape}")
print(f"Train class distribution: {np.bincount(y_train)}")
print(f"Test class distribution: {np.bincount(y_test)}")

7.2.5 Model Training and Evaluation

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# 1. Logistic Regression
print("\n1. Logistic Regression")
lr = LogisticRegression(random_state=42, max_iter=1000)
lr.fit(X_train, y_train)

y_pred_lr = lr.predict(X_test)
accuracy_lr = accuracy_score(y_test, y_pred_lr)
print(f"Accuracy: {accuracy_lr:.2f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_lr))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred_lr))

# 2. Random Forest
print("\n2. Random Forest")
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

y_pred_rf = rf.predict(X_test)
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f"Accuracy: {accuracy_rf:.2f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_rf))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred_rf))

# Compare models
print("\nModel Comparison:")
print(f"Logistic Regression Accuracy: {accuracy_lr:.2f}")
print(f"Random Forest Accuracy: {accuracy_rf:.2f}")

# Feature importance from Random Forest
if hasattr(rf, 'feature_importances_'):
    # Get feature names after one-hot encoding
    cat_feature_names = preprocessor.named_transformers_['cat'].get_feature_names_out(categorical_cols)
    all_feature_names = numerical_cols + cat_feature_names.tolist()
    
    # Get feature importances
    feature_importances = pd.Series(rf.feature_importances_, index=all_feature_names)
    feature_importances = feature_importances.sort_values(ascending=False)
    
    print("\nTop 10 Important Features:")
    print(feature_importances.head(10))
    
    # Visualize feature importances
    plt.figure(figsize=(12, 8))
    feature_importances.head(15).plot(kind='barh')
    plt.title('Top 15 Important Features')
    plt.xlabel('Importance')
    plt.ylabel('Feature')
    plt.tight_layout()
    plt.show()

8. Interactive Exercises

Exercise 1: Data Cleaning

Using the following dataset, complete the data cleaning tasks:

import pandas as pd
import numpy as np

# Create sample data
np.random.seed(42)
data = pd.DataFrame({
    'id': range(1, 11),
    'name': ['Alice', 'Bob', 'Charlie', 'Alice', 'David', 'Eve', 'Frank', 'Grace', 'Heidi', 'Ivan'],
    'age': [25, np.nan, 30, 25, np.nan, 35, 40, 45, 50, 55],
    'salary': [50000, 60000, np.nan, 50000, 70000, 80000, 90000, 100000, 110000, 120000],
    'department': ['HR', 'IT', 'Finance', 'HR', 'IT', 'Finance', 'HR', 'IT', 'Finance', 'IT']
})

Identify missing values in the dataset.
processing missing values in the 'age' column using median imputation.
processing missing values in the 'salary' column using mean imputation.
Remove duplicate rows from the dataset.
Verify that there are no more missing values after cleaning.

Exercise 2: Feature Scaling

Using the following dataset, apply different feature scale techniques:

import pandas as pd
import numpy as np

# Create sample data
np.random.seed(42)
data = pd.DataFrame({
    'feature1': np.random.normal(100, 10, 20),
    'feature2': np.random.uniform(0, 1, 20),
    'feature3': np.random.exponential(5, 20)
})

Apply standardization to all features.
Apply Min-Max scale to all features.
Compare the mean, standard deviation, minimum, and maximum values before and after scale.
Which scale technique would you choose for this dataset and why?

Exercise 3: Feature Encoding

Using the following dataset, apply different feature encoding techniques:

import pandas as pd

# Create sample data
data = pd.DataFrame({
    'color': ['red', 'blue', 'green', 'red', 'blue', 'green'],
    'size': ['S', 'M', 'L', 'M', 'L', 'XL'],
    'material': ['cotton', 'wool', 'silk', 'cotton', 'wool', 'silk'],
    'price': [10, 20, 30, 15, 25, 35]
})

Apply one-hot encoding to the 'color' column.
Apply ordinal encoding to the 'size' column (define the order as S < M < L < XL).
Apply label encoding to the 'material' column.
Combine the encoded features with the 'price' column to create a final dataset.

Exercise 4: Data Preprocessing Pipeline

Create a complete data preprocessing pipeline for the Iris dataset:

Load the Iris dataset from scikit-learn.
Split the dataset into training (80%) and testing (20%) sets.
Create a preprocessing pipeline that includes:
- Standardization of numerical features
- No encoding needed for this dataset (all features are numerical)
Train a Logistic Regression model on the preprocessed training data.
Evaluate the model on the preprocessed test data using accuracy score.