1. Introduction to Data Preprocessing
Data preprocessing is a critical step in the machine learning workflow. It involves transforming raw data into a clean, structured format that can be effectively used by machine learning algorithmss. Proper data preprocessing can significantly improve model performance and accuracy.
1.1 Why Data Preprocessing is Important
- Improves Model Performance: Clean, well-structured data allows models to learn patterns more effectively.
- processings Missing Values: Missing data can cause models to fail or produce inaccurate results.
- Normalizes Data: Features on different scales can bias models toward features with larger values.
- Reduces Noise: Removes irrelevant or redundant information from the data.
- Enhances Interpretability: Well-processed data makes model decisions more understandable.
1.2 Common Data Preprocessing Steps
- Data Cleaning: Handling missing values, removing duplicates, correcting errors
- Feature Scaling: Normalizing or standardizing features
- Feature Encoding: Converting categorical data to numerical format
- Feature Selection: Identifying and using only relevant features
- Feature Transformation: Creating new features from existing ones
- Data Splitting: Dividing data into training, validation, and test sets
2. Data Cleaning
Data cleaning is the process of identifying and correcting errors, inconsistencies, and missing values in the dataset. It is one of the most time-consuming but essential steps in data preprocessing.
2.1 Handling Missing Values
Missing values can occur for various reasons, such as data entry errors, sensor failures, or incomplete surveys. There are several strategies to handle missing values:
2.1.1 Deletion
Remove rows or columns with missing values. This is only recommended when the missing data is minimal and randomly distributed.
2.1.2 Imputation
Replace missing values with estimated values based on the available data.
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
# Create sample data with missing values
data = pd.DataFrame({
'age': [25, np.nan, 30, 35, np.nan, 40],
'income': [50000, 60000, np.nan, 80000, 90000, 100000],
'score': [85, 90, 75, np.nan, 80, 95]
})
print("Original Data:")
print(data)
# Check for missing values
print("\nMissing Values:")
print(data.isnull().sum())
# 1. Deletion
# Drop rows with any missing values
data_dropna = data.dropna()
print("\nAfter Dropping Rows:")
print(data_dropna)
# Drop columns with any missing values
data_dropcols = data.dropna(axis=1)
print("\nAfter Dropping Columns:")
print(data_dropcols)
# 2. Imputation
# Replace with mean
imputer_mean = SimpleImputer(strategy='mean')
data_imputed_mean = pd.DataFrame(imputer_mean.fit_transform(data), columns=data.columns)
print("\nAfter Mean Imputation:")
print(data_imputed_mean)
# Replace with median
imputer_median = SimpleImputer(strategy='median')
data_imputed_median = pd.DataFrame(imputer_median.fit_transform(data), columns=data.columns)
print("\nAfter Median Imputation:")
print(data_imputed_median)
# Replace with most frequent value
imputer_mode = SimpleImputer(strategy='most_frequent')
data_imputed_mode = pd.DataFrame(imputer_mode.fit_transform(data), columns=data.columns)
print("\nAfter Mode Imputation:")
print(data_imputed_mode)
# Replace with constant value
imputer_constant = SimpleImputer(strategy='constant', fill_value=0)
data_imputed_constant = pd.DataFrame(imputer_constant.fit_transform(data), columns=data.columns)
print("\nAfter Constant Imputation:")
print(data_imputed_constant)
2.2 Removing Duplicates
Duplicate rows can skew model training by giving more weight to repeated information. It's important to identify and remove duplicate entries.
import pandas as pd
# Create sample data with duplicates
data = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob'],
'age': [25, 30, 35, 25, 30],
'score': [85, 90, 75, 85, 90]
})
print("Original Data:")
print(data)
# Check for duplicates
print("\nDuplicate Rows:")
print(data.duplicated())
# Count duplicates
print("\nNumber of Duplicate Rows:")
print(data.duplicated().sum())
# Remove duplicates
data_no_duplicates = data.drop_duplicates()
print("\nAfter Removing Duplicates:")
print(data_no_duplicates)
# Remove duplicates based on specific columns
data_no_duplicates_name = data.drop_duplicates(subset=['name'])
print("\nAfter Removing Duplicates by Name:")
print(data_no_duplicates_name)
2.3 Handling Outliers
Outliers are data points that deviate significantly from the rest of the dataset. They can be caused by measurement errors or genuine anomalies. Handling outliers appropriately is important for model performance.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Create sample data with outliers
data = pd.DataFrame({
'value': [10, 12, 14, 15, 16, 18, 20, 22, 24, 100] # 100 is an outlier
})
print("Original Data:")
print(data)
# Visualize with box plot
plt.figure(figsize=(10, 6))
plt.boxplot(data['value'])
plt.title('Box Plot to Identify Outliers')
plt.ylabel('Value')
plt.show()
# 1. IQR Method
Q1 = data['value'].quantile(0.25)
Q3 = data['value'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
print(f"\nQ1: {Q1}")
print(f"Q3: {Q3}")
print(f"IQR: {IQR}")
print(f"Lower Bound: {lower_bound}")
print(f"Upper Bound: {upper_bound}")
# Identify outliers
outliers = data[(data['value'] < lower_bound) | (data['value'] > upper_bound)]
print("\nOutliers:")
print(outliers)
# Remove outliers
data_no_outliers = data[(data['value'] >= lower_bound) & (data['value'] <= upper_bound)]
print("\nAfter Removing Outliers:")
print(data_no_outliers)
# 2. Z-Score Method
from scipy import stats
z_scores = np.abs(stats.zscore(data['value']))
threshold = 3
print("\nZ-Scores:")
print(z_scores)
# Identify outliers
outliers_z = data[z_scores > threshold]
print("\nOutliers (Z-Score > 3):")
print(outliers_z)
# Remove outliers
data_no_outliers_z = data[z_scores <= threshold]
print("\nAfter Removing Outliers (Z-Score):")
print(data_no_outliers_z)
3. Feature Scaling
Feature scale is the process of normalizing or standardizing features to ensure they are on a similar scale. This is important because many machine learning algorithmss are sensitive to the scale of input features.
3.1 Types of Feature Scaling
3.1.1 Standardization (Z-Score Normalization)
Transforms features to have a mean of 0 and a standard deviation of 1. This is useful when features follow a Gaussian distribution.
3.1.2 Min-Max Scaling (Normalization)
Scales features to a fixed range, typically [0, 1] or [-1, 1]. This is useful when features have bounded ranges.
3.1.3 Robust Scaling
Uses median and IQR to scale features, making it robust to outliers.
3.1.4 MaxAbs Scaling
Scales features by their maximum absolute value, preserving the sign of the data.
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler, MaxAbsScaler
# Create sample data
np.random.seed(42)
data = pd.DataFrame({
'feature1': np.random.normal(100, 10, 10), # Mean=100, Std=10
'feature2': np.random.uniform(0, 100, 10), # Range [0, 100]
'feature3': np.random.exponential(5, 10) # Exponential distribution
})
print("Original Data:")
print(data)
print("\nOriginal Statistics:")
print(data.describe())
# 1. Standardization
scaler_standard = StandardScaler()
data_standard = pd.DataFrame(scaler_standard.fit_transform(data), columns=data.columns)
print("\nAfter Standardization:")
print(data_standard)
print("\nStatistics After Standardization:")
print(data_standard.describe())
# 2. Min-Max Scaling
scaler_minmax = MinMaxScaler()
data_minmax = pd.DataFrame(scaler_minmax.fit_transform(data), columns=data.columns)
print("\nAfter Min-Max Scaling:")
print(data_minmax)
print("\nStatistics After Min-Max Scaling:")
print(data_minmax.describe())
# 3. Robust Scaling
scaler_robust = RobustScaler()
data_robust = pd.DataFrame(scaler_robust.fit_transform(data), columns=data.columns)
print("\nAfter Robust Scaling:")
print(data_robust)
print("\nStatistics After Robust Scaling:")
print(data_robust.describe())
# 4. MaxAbs Scaling
scaler_maxabs = MaxAbsScaler()
data_maxabs = pd.DataFrame(scaler_maxabs.fit_transform(data), columns=data.columns)
print("\nAfter MaxAbs Scaling:")
print(data_maxabs)
print("\nStatistics After MaxAbs Scaling:")
print(data_maxabs.describe())
4. Feature Encoding
Feature encoding is the process of converting categorical data into numerical format that can be used by machine learning algorithmss. Most algorithmss require numerical inputs, so categorical features must be properly encoded.
4.1 Types of Categorical Data
- Nominal: Categories with no inherent order (e.g., colors, countries)
- Ordinal: Categories with a specific order (e.g., ratings, education levels)
4.2 Encoding Techniques
4.2.1 One-Hot Encoding
Creates a binary column for each category. Suitable for nominal data with a small number of categories.
4.2.2 Label Encoding
Assigns a unique integer to each category. Suitable for ordinal data.
4.2.3 Ordinal Encoding
Similar to label encoding but allows specifying the order of categories explicitly.
4.2.4 Binary Encoding
Converts categories to binary code, reducing dimensionality compared to one-hot encoding.
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, OrdinalEncoder
import category_encoders as ce
# Create sample data
data = pd.DataFrame({
'color': ['red', 'blue', 'green', 'red', 'green'],
'size': ['S', 'M', 'L', 'M', 'XL'],
'rating': ['poor', 'good', 'excellent', 'good', 'excellent']
})
print("Original Data:")
print(data)
# 1. One-Hot Encoding
print("\n1. One-Hot Encoding:")
data_onehot = pd.get_dummies(data, columns=['color'])
print(data_onehot)
# Using scikit-learn OneHotEncoder
encoder_onehot = OneHotEncoder(sparse_output=False)
color_encoded = encoder_onehot.fit_transform(data[['color']])
color_df = pd.DataFrame(color_encoded, columns=encoder_onehot.get_feature_names_out(['color']))
data_onehot_sklearn = pd.concat([data, color_df], axis=1).drop('color', axis=1)
print("\nOne-Hot Encoding with scikit-learn:")
print(data_onehot_sklearn)
# 2. Label Encoding
print("\n2. Label Encoding:")
label_encoder = LabelEncoder()
data_label = data.copy()
data_label['color_encoded'] = label_encoder.fit_transform(data['color'])
print(data_label)
# 3. Ordinal Encoding
print("\n3. Ordinal Encoding:")
# Define the order for size and rating
size_order = ['S', 'M', 'L', 'XL']
rating_order = ['poor', 'good', 'excellent']
ordinal_encoder = OrdinalEncoder(categories=[size_order, rating_order])
data_ordinal = data.copy()
data_ordinal[['size_encoded', 'rating_encoded']] = ordinal_encoder.fit_transform(data[['size', 'rating']])
print(data_ordinal)
# 4. Binary Encoding
print("\n4. Binary Encoding:")
binary_encoder = ce.BinaryEncoder(cols=['color'])
data_binary = binary_encoder.fit_transform(data)
print(data_binary)
5. Feature Selection and Transformation
Feature selection involves identifying and using only the most relevant features for model training. Feature transformation involves creating new features from existing ones to improve model performance.
5.1 Feature Selection Methods
5.1.1 Filter Methods
Evaluate features based on statistical tests or correlation with the target variable.
5.1.2 Wrapper Methods
Evaluate feature subsets by training models and measuring performance.
5.1.3 Embedded Methods
Select features during model training (e.g., LASSO regularization).
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectKBest, chi2, RFE, SelectFromModel
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
# Load sample data
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)
print(f"Original number of features: {X.shape[1]}")
print(f"Feature names: {X.columns.tolist()}")
# 1. Filter Method: SelectKBest
print("\n1. Filter Method: SelectKBest")
selector_kbest = SelectKBest(chi2, k=10)
X_kbest = selector_kbest.fit_transform(X, y)
selected_features_kbest = X.columns[selector_kbest.get_support()]
print(f"Selected features: {selected_features_kbest.tolist()}")
# 2. Wrapper Method: Recursive Feature Elimination (RFE)
print("\n2. Wrapper Method: RFE")
model = LogisticRegression(max_iter=10000)
rfe = RFE(model, n_features_to_select=10)
X_rfe = rfe.fit_transform(X, y)
selected_features_rfe = X.columns[rfe.get_support()]
print(f"Selected features: {selected_features_rfe.tolist()}")
# 3. Embedded Method: Feature Importance from Random Forest
print("\n3. Embedded Method: Feature Importance")
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)
# Get feature importances
feature_importances = pd.Series(rf.feature_importances_, index=X.columns)
feature_importances = feature_importances.sort_values(ascending=False)
print("Feature Importances:")
print(feature_importances.head(10))
# Select top 10 features
selected_features_rf = feature_importances.head(10).index.tolist()
print(f"\nSelected features: {selected_features_rf}")
# Using SelectFromModel
selector_rf = SelectFromModel(rf, max_features=10, threshold=-np.inf)
X_rf = selector_rf.fit_transform(X, y)
selected_features_sfm = X.columns[selector_rf.get_support()]
print(f"\nSelected features with SelectFromModel: {selected_features_sfm.tolist()}")
5.2 Feature Transformation
Feature transformation involves creating new features from existing ones to capture complex relationships in the data.
import numpy as np
import pandas as pd
# Create sample data
data = pd.DataFrame({
'age': [25, 30, 35, 40, 45],
'income': [50000, 60000, 70000, 80000, 90000],
'expenditure': [30000, 35000, 40000, 45000, 50000]
})
print("Original Data:")
print(data)
# 1. Polynomial Features
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False)
poly_features = poly.fit_transform(data)
poly_feature_names = poly.get_feature_names_out(data.columns)
data_poly = pd.DataFrame(poly_features, columns=poly_feature_names)
print("\nAfter Polynomial Features:")
print(data_poly)
# 2. Interaction Features
# Create interaction between age and income
data['age_income_interaction'] = data['age'] * data['income']
# Create ratio feature
data['savings_ratio'] = (data['income'] - data['expenditure']) / data['income']
print("\nAfter Creating Custom Features:")
print(data)
# 3. Log Transformation
# Apply log transformation to income (useful for skewed data)
data['log_income'] = np.log(data['income'])
print("\nAfter Log Transformation:")
print(data[['income', 'log_income']])
# 4. Binning
# Create age bins
data['age_bin'] = pd.cut(data['age'], bins=[20, 30, 40, 50], labels=['young', 'middle', 'old'])
print("\nAfter Binning:")
print(data[['age', 'age_bin']])
# 5. One-Hot Encode Binned Feature
data_encoded = pd.get_dummies(data, columns=['age_bin'])
print("\nAfter One-Hot Encoding Binned Feature:")
print(data_encoded)
6. Data Splitting
Data splitting is the process of dividing the dataset into separate subsets for training, validation, and testing. This is essential for evaluating model performance and preventing overfitting.
6.1 Common Splitting Strategies
6.1.1 Train-Test Split
Divide data into training (typically 70-80%) and testing (20-30%) sets.
6.1.2 Train-Validation-Test Split
Divide data into three sets: training, validation, and testing.
6.1.3 Cross-Validation
Use k-fold cross-validation to evaluate model performance more robustly.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, KFold, StratifiedKFold
from sklearn.datasets import load_iris
# Load sample data
data = load_iris()
X, y = data.data, data.target
print(f"Dataset size: {X.shape}")
# 1. Train-Test Split
print("\n1. Train-Test Split")
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Train size: {X_train.shape}, Test size: {X_test.shape}")
print(f"Train class distribution: {np.bincount(y_train)}")
print(f"Test class distribution: {np.bincount(y_test)}")
# 2. Train-Validation-Test Split
print("\n2. Train-Validation-Test Split")
# First split into train and temp
X_train, X_temp, y_train, y_temp = train_test_split(
X, y, test_size=0.4, random_state=42, stratify=y
)
# Then split temp into validation and test
X_val, X_test, y_val, y_test = train_test_split(
X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp
)
print(f"Train size: {X_train.shape}, Validation size: {X_val.shape}, Test size: {X_test.shape}")
# 3. K-Fold Cross-Validation
print("\n3. K-Fold Cross-Validation")
kf = KFold(n_splits=5, shuffle=True, random_state=42)
fold = 1
for train_idx, test_idx in kf.split(X):
X_train_kf, X_test_kf = X[train_idx], X[test_idx]
y_train_kf, y_test_kf = y[train_idx], y[test_idx]
print(f"Fold {fold}: Train size={X_train_kf.shape}, Test size={X_test_kf.shape}")
fold += 1
# 4. Stratified K-Fold Cross-Validation
print("\n4. Stratified K-Fold Cross-Validation")
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
fold = 1
for train_idx, test_idx in skf.split(X, y):
X_train_skf, X_test_skf = X[train_idx], X[test_idx]
y_train_skf, y_test_skf = y[train_idx], y[test_idx]
print(f"Fold {fold}: Train size={X_train_skf.shape}, Test size={X_test_skf.shape}")
print(f" Train class distribution: {np.bincount(y_train_skf)}")
print(f" Test class distribution: {np.bincount(y_test_skf)}")
fold += 1
7. Practice Case
7.1 Case Objective
Preprocess the Titanic dataset for a classification task to predict passenger survival. This will help you practice the complete data preprocessing workflow, from cleaning to feature engineering.
7.2 Implementation Steps
7.2.1 Data Loading and Exploration
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_openml
# Load Titanic dataset
titanic = fetch_openml('titanic', version=1, as_frame=True)
df = titanic.frame
print("Dataset shape:", df.shape)
print("\nFirst 5 rows:")
print(df.head())
print("\nColumns:", df.columns.tolist())
print("\nData types:")
print(df.dtypes)
# Check for missing values
print("\nMissing values:")
print(df.isnull().sum())
# Check target variable distribution
print("\nSurvival distribution:")
print(df['survived'].value_counts())
print("\nSurvival percentage:")
print(df['survived'].value_counts(normalize=True))
# Visualize missing values
plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values Heatmap')
plt.show()
7.2.2 Data Cleaning
# processing missing values
# 1. Age: Impute with median
median_age = df['age'].median()
df['age'] = df['age'].fillna(median_age)
# 2. Fare: Impute with median
median_fare = df['fare'].median()
df['fare'] = df['fare'].fillna(median_fare)
# 3. Embarked: Impute with most frequent value
most_frequent_embarked = df['embarked'].mode()[0]
df['embarked'] = df['embarked'].fillna(most_frequent_embarked)
# 4. Cabin: Create a binary feature indicating if cabin is known
df['has_cabin'] = df['cabin'].notnull().astype(int)
# Drop cabin column as it has too many missing values
df = df.drop('cabin', axis=1)
# Check missing values after cleaning
print("\nMissing values after cleaning:")
print(df.isnull().sum())
# Remove duplicate rows
print("\nNumber of duplicate rows:", df.duplicated().sum())
df = df.drop_duplicates()
print("Dataset shape after removing duplicates:", df.shape)
# processing categorical variables
print("\nCategorical variables:")
categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
print(categorical_cols)
# Check unique values for each categorical column
for col in categorical_cols:
print(f"\n{col} unique values:")
print(df[col].unique())
print(f"Number of unique values: {df[col].nunique()}")
7.2.3 Feature Engineering
# Create new features
# 1. Family size: siblings/spouses + parents/children + 1 (self)
df['family_size'] = df['sibsp'] + df['parch'] + 1
# 2. Is alone: 1 if family_size == 1, 0 otherwise
df['is_alone'] = (df['family_size'] == 1).astype(int)
# 3. Fare per person: fare divided by family size
df['fare_per_person'] = df['fare'] / df['family_size']
# 4. Age group: Create age bins
df['age_group'] = pd.cut(df['age'], bins=[0, 12, 18, 35, 60, 100], labels=['child', 'teenager', 'adult', 'middle_aged', 'senior'])
# 5. Title: Extract title from name
import re
def extract_title(name):
title_search = re.search(' ([A-Za-z]+)\.', name)
if title_search:
return title_search.group(1)
return ""
df['title'] = df['name'].apply(extract_title)
# Group rare titles
print("\nTitle distribution:")
print(df['title'].value_counts())
# Replace rare titles with 'Other'
rare_titles = ['Lady', 'Countess','Capt', 'Col', 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona']
df['title'] = df['title'].replace(rare_titles, 'Other')
# Check title distribution after grouping
print("\nTitle distribution after grouping:")
print(df['title'].value_counts())
# Drop unnecessary columns
df = df.drop(['name', 'ticket', 'boat', 'body', 'home.dest'], axis=1)
print("\nDataset shape after feature engineering:")
print(df.shape)
print("\nFinal columns:", df.columns.tolist())
7.2.4 Feature Encoding and Scaling
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
# Separate features and target
target = df['survived'].astype(int)
features = df.drop('survived', axis=1)
# Define categorical and numerical columns
categorical_cols = ['sex', 'embarked', 'age_group', 'title']
numerical_cols = ['pclass', 'age', 'sibsp', 'parch', 'fare', 'has_cabin', 'family_size', 'is_alone', 'fare_per_person']
# Create preprocessing pipeline
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), numerical_cols),
('cat', OneHotEncoder(drop='first'), categorical_cols)
])
# Fit and transform the data
X_preprocessed = preprocessor.fit_transform(features)
y_preprocessed = target.values
print(f"\nPreprocessed features shape: {X_preprocessed.shape}")
print(f"Target shape: {y_preprocessed.shape}")
# Split into train and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X_preprocessed, y_preprocessed, test_size=0.2, random_state=42, stratify=y_preprocessed
)
print(f"\nTrain size: {X_train.shape}")
print(f"Test size: {X_test.shape}")
print(f"Train class distribution: {np.bincount(y_train)}")
print(f"Test class distribution: {np.bincount(y_test)}")
7.2.5 Model Training and Evaluation
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# 1. Logistic Regression
print("\n1. Logistic Regression")
lr = LogisticRegression(random_state=42, max_iter=1000)
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)
accuracy_lr = accuracy_score(y_test, y_pred_lr)
print(f"Accuracy: {accuracy_lr:.2f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_lr))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred_lr))
# 2. Random Forest
print("\n2. Random Forest")
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f"Accuracy: {accuracy_rf:.2f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_rf))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred_rf))
# Compare models
print("\nModel Comparison:")
print(f"Logistic Regression Accuracy: {accuracy_lr:.2f}")
print(f"Random Forest Accuracy: {accuracy_rf:.2f}")
# Feature importance from Random Forest
if hasattr(rf, 'feature_importances_'):
# Get feature names after one-hot encoding
cat_feature_names = preprocessor.named_transformers_['cat'].get_feature_names_out(categorical_cols)
all_feature_names = numerical_cols + cat_feature_names.tolist()
# Get feature importances
feature_importances = pd.Series(rf.feature_importances_, index=all_feature_names)
feature_importances = feature_importances.sort_values(ascending=False)
print("\nTop 10 Important Features:")
print(feature_importances.head(10))
# Visualize feature importances
plt.figure(figsize=(12, 8))
feature_importances.head(15).plot(kind='barh')
plt.title('Top 15 Important Features')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.tight_layout()
plt.show()
8. Interactive Exercises
Exercise 1: Data Cleaning
Using the following dataset, complete the data cleaning tasks:
import pandas as pd
import numpy as np
# Create sample data
np.random.seed(42)
data = pd.DataFrame({
'id': range(1, 11),
'name': ['Alice', 'Bob', 'Charlie', 'Alice', 'David', 'Eve', 'Frank', 'Grace', 'Heidi', 'Ivan'],
'age': [25, np.nan, 30, 25, np.nan, 35, 40, 45, 50, 55],
'salary': [50000, 60000, np.nan, 50000, 70000, 80000, 90000, 100000, 110000, 120000],
'department': ['HR', 'IT', 'Finance', 'HR', 'IT', 'Finance', 'HR', 'IT', 'Finance', 'IT']
})
- Identify missing values in the dataset.
- processing missing values in the 'age' column using median imputation.
- processing missing values in the 'salary' column using mean imputation.
- Remove duplicate rows from the dataset.
- Verify that there are no more missing values after cleaning.
Exercise 2: Feature Scaling
Using the following dataset, apply different feature scale techniques:
import pandas as pd
import numpy as np
# Create sample data
np.random.seed(42)
data = pd.DataFrame({
'feature1': np.random.normal(100, 10, 20),
'feature2': np.random.uniform(0, 1, 20),
'feature3': np.random.exponential(5, 20)
})
- Apply standardization to all features.
- Apply Min-Max scale to all features.
- Compare the mean, standard deviation, minimum, and maximum values before and after scale.
- Which scale technique would you choose for this dataset and why?
Exercise 3: Feature Encoding
Using the following dataset, apply different feature encoding techniques:
import pandas as pd
# Create sample data
data = pd.DataFrame({
'color': ['red', 'blue', 'green', 'red', 'blue', 'green'],
'size': ['S', 'M', 'L', 'M', 'L', 'XL'],
'material': ['cotton', 'wool', 'silk', 'cotton', 'wool', 'silk'],
'price': [10, 20, 30, 15, 25, 35]
})
- Apply one-hot encoding to the 'color' column.
- Apply ordinal encoding to the 'size' column (define the order as S < M < L < XL).
- Apply label encoding to the 'material' column.
- Combine the encoded features with the 'price' column to create a final dataset.
Exercise 4: Data Preprocessing Pipeline
Create a complete data preprocessing pipeline for the Iris dataset:
- Load the Iris dataset from scikit-learn.
- Split the dataset into training (80%) and testing (20%) sets.
- Create a preprocessing pipeline that includes:
- Standardization of numerical features
- No encoding needed for this dataset (all features are numerical)
- Train a Logistic Regression model on the preprocessed training data.
- Evaluate the model on the preprocessed test data using accuracy score.