Feature Engineering Overview
Feature engineering is a critical step in the machine learning workflow that involves creating, selecting, and transforming features to improve model performance and generalization. Good feature engineering can:
- Improve model prediction accuracy
- Reduce model training time
- Enhance model generalization and reduce overfitting
- Make models more stable and interpretable
Feature engineering mainly includes the following aspects:
- Feature creation: Creating new features from raw data
- Feature selection: Selecting features that are most helpful for prediction
- Feature extraction: Extracting low-dimensional representations from high-dimensional data
- Feature transformation: Transforming features through standardization, normalization, etc.
Feature Creation
1. Numeric Feature Creation
Numeric feature creation involves creating new features from raw numeric data. Common methods include:
- Polynomial features
- Interaction features
- Statistical features (mean, variance, quantiles, etc.)
- Time features (year, month, day, week, etc.)
import numpy as np
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
# Create sample data
data = pd.DataFrame({
'age': [25, 30, 35, 40, 45],
'income': [50000, 60000, 70000, 80000, 90000],
'expenses': [30000, 35000, 40000, 45000, 50000]
})
print("Original data:")
print(data)
print()
# Create new feature: savings rate
data['savings_rate'] = (data['income'] - data['expenses']) / data['income']
# Create new feature: age-income interaction
data['age_income'] = data['age'] * data['income']
# Create polynomial features
poly = PolynomialFeatures(degree=2, include_bias=False)
poly_features = poly.fit_transform(data[['age', 'income']])
poly_feature_names = poly.get_feature_names_out(['age', 'income'])
poly_data = pd.DataFrame(poly_features, columns=poly_feature_names)
data = pd.concat([data, poly_data], axis=1)
print("Data after adding new features:")
print(data)
print()
# Create statistical features
data['income_mean'] = data['income'].mean()
data['income_std'] = data['income'].std()
data['income_min'] = data['income'].min()
data['income_max'] = data['income'].max()
print("Data after adding statistical features:")
print(data)
2. Categorical Feature Creation
Categorical feature creation involves creating new features from raw categorical data. Common methods include:
- Frequency encoding
- Target encoding
- WOE (Weight of Evidence) encoding
- Combination features
import pandas as pd
import numpy as np
# Create sample data
data = pd.DataFrame({
'gender': ['M', 'F', 'M', 'F', 'M'],
'education': ['High School', 'College', 'College', 'Graduate', 'Graduate'],
'occupation': ['Engineer', 'Teacher', 'Doctor', 'Engineer', 'Doctor'],
'salary': [60000, 50000, 80000, 70000, 90000]
})
print("Original data:")
print(data)
print()
# Frequency encoding
gender_counts = data['gender'].value_counts()
data['gender_freq'] = data['gender'].map(gender_counts)
education_counts = data['education'].value_counts()
data['education_freq'] = data['education'].map(education_counts)
# Target encoding (based on salary mean)
gender_target = data.groupby('gender')['salary'].mean()
data['gender_target'] = data['gender'].map(gender_target)
education_target = data.groupby('education')['salary'].mean()
data['education_target'] = data['education'].map(education_target)
# Combination feature
data['education_occupation'] = data['education'] + '_' + data['occupation']
print("Data after adding new features:")
print(data)
print()
# WOE encoding (simplified version)
# We use a binary classification example to demonstrate WOE
data['high_income'] = (data['salary'] > 65000).astype(int)
# Calculate WOE
for feature in ['gender', 'education', 'occupation']:
woe_dict = {}
for category in data[feature].unique():
# Calculate the number of high and low income in this category
good = len(data[(data[feature] == category) & (data['high_income'] == 1)])
bad = len(data[(data[feature] == category) & (data['high_income'] == 0)])
# Avoid division by zero
good = max(good, 0.5)
bad = max(bad, 0.5)
# Calculate WOE
woe = np.log(good / bad)
woe_dict[category] = woe
# Add WOE feature
data[f'{feature}_woe'] = data[feature].map(woe_dict)
print("Data after adding WOE features:")
print(data)
3. Time Feature Creation
Time feature creation involves extracting meaningful features from time data. Common methods include:
- Extracting year, month, day, hour, minute, second
- Extracting day of week, month name
- Extracting weekend/weekday, holiday indicators
- Calculating time differences
- Extracting seasonal features
import pandas as pd
import numpy as np
# Create sample time data
dates = pd.date_range('2023-01-01', '2023-01-10', freq='D')
data = pd.DataFrame({
'date': dates,
'sales': np.random.randint(100, 1000, size=len(dates))
})
print("Original data:")
print(data)
print()
# Extract time features
data['year'] = data['date'].dt.year
data['month'] = data['date'].dt.month
data['day'] = data['date'].dt.day
data['day_of_week'] = data['date'].dt.dayofweek
data['weekday_name'] = data['date'].dt.day_name()
data['is_weekend'] = data['day_of_week'].apply(lambda x: 1 if x >= 5 else 0)
data['is_month_start'] = data['date'].dt.is_month_start.astype(int)
data['is_month_end'] = data['date'].dt.is_month_end.astype(int)
# Calculate time differences
data['days_since_start'] = (data['date'] - data['date'].min()).dt.days
# Extract seasonal features (using sine and cosine transformations)
data['day_sin'] = np.sin(2 * np.pi * data['day'] / 31)
data['day_cos'] = np.cos(2 * np.pi * data['day'] / 31)
data['month_sin'] = np.sin(2 * np.pi * data['month'] / 12)
data['month_cos'] = np.cos(2 * np.pi * data['month'] / 12)
print("Data after adding time features:")
print(data)
print()
# Check data types
print("Data types:")
print(data.dtypes)
Feature Selection
1. Filter-Based Feature Selection
Filter-based feature selection selects features by calculating statistical relationships between features and the target variable. Common methods include:
- Pearson correlation coefficient
- Chi-square test
- Mutual information
- Variance threshold
import pandas as pd
import numpy as np
from sklearn.datasets import load_diabetes
from sklearn.feature_selection import SelectKBest, f_regression, chi2, mutual_info_regression, VarianceThreshold
from sklearn.preprocessing import StandardScaler
# Load dataset
diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target
feature_names = diabetes.feature_names
print(f"Original number of features: {X.shape[1]}")
print(f"Feature names: {feature_names}")
print()
# 1. Variance threshold
print("1. Variance threshold selection:")
selector = VarianceThreshold(threshold=0.01)
X_var = selector.fit_transform(X)
selected_features_var = [feature_names[i] for i in range(len(feature_names)) if selector.get_support()[i]]
print(f"Number of selected features: {X_var.shape[1]}")
print(f"Selected features: {selected_features_var}")
print()
# 2. Pearson correlation coefficient
print("2. Pearson correlation coefficient selection:")
selector = SelectKBest(score_func=f_regression, k=5)
X_f = selector.fit_transform(X, y)
scores = selector.scores_
selected_features_f = [feature_names[i] for i in range(len(feature_names)) if selector.get_support()[i]]
print(f"Number of selected features: {X_f.shape[1]}")
print(f"Selected features: {selected_features_f}")
print("Feature scores:")
for feature, score in zip(feature_names, scores):
print(f"{feature}: {score:.4f}")
print()
# 3. Mutual information
print("3. Mutual information selection:")
selector = SelectKBest(score_func=mutual_info_regression, k=5)
X_mi = selector.fit_transform(X, y)
mi_scores = selector.scores_
selected_features_mi = [feature_names[i] for i in range(len(feature_names)) if selector.get_support()[i]]
print(f"Number of selected features: {X_mi.shape[1]}")
print(f"Selected features: {selected_features_mi}")
print("Feature scores:")
for feature, score in zip(feature_names, mi_scores):
print(f"{feature}: {score:.4f}")
print()
# 4. Correlation matrix
print("4. Correlation matrix:")
df = pd.DataFrame(X, columns=feature_names)
df['target'] = y
corr_matrix = df.corr()
print("Correlation between features and target:")
print(corr_matrix['target'].sort_values(ascending=False))
print()
# 5. Remove highly correlated features
print("5. Removing highly correlated features:")
corr_threshold = 0.5
high_corr_features = set()
for i in range(len(corr_matrix.columns)):
for j in range(i):
if abs(corr_matrix.iloc[i, j]) > corr_threshold:
feature_i = corr_matrix.columns[i]
feature_j = corr_matrix.columns[j]
# Keep the feature with higher correlation to target
if abs(corr_matrix.loc[feature_i, 'target']) > abs(corr_matrix.loc[feature_j, 'target']):
high_corr_features.add(feature_j)
else:
high_corr_features.add(feature_i)
print(f"Highly correlated features to remove: {high_corr_features}")
selected_features_corr = [feature for feature in feature_names if feature not in high_corr_features]
print(f"Selected features: {selected_features_corr}")
2. Wrapper-Based Feature Selection
Wrapper-based feature selection selects features by evaluating model performance on different feature subsets. Common methods include:
- Recursive Feature Elimination (RFE)
- Recursive Feature Elimination with Cross-Validation (RFECV)
- Sequential Feature Selection (SFS)
import numpy as np
from sklearn.datasets import load_diabetes
from sklearn.feature_selection import RFE, RFECV
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, cross_val_score
# Load dataset
diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target
feature_names = diabetes.feature_names
print(f"Original number of features: {X.shape[1]}")
print(f"Feature names: {feature_names}")
print()
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 1. Recursive Feature Elimination (RFE)
print("1. Recursive Feature Elimination (RFE):")
model = LinearRegression()
rfe = RFE(estimator=model, n_features_to_select=5, step=1)
rfe.fit(X_train, y_train)
selected_features_rfe = [feature_names[i] for i in range(len(feature_names)) if rfe.support_[i]]
print(f"Number of selected features: {rfe.n_features_}")
print(f"Selected features: {selected_features_rfe}")
print("Feature rankings:")
for feature, rank in zip(feature_names, rfe.ranking_):
print(f"{feature}: {rank}")
print()
# 2. Recursive Feature Elimination with Cross-Validation (RFECV)
print("2. Recursive Feature Elimination with Cross-Validation (RFECV):")
model = RandomForestRegressor(n_estimators=100, random_state=42)
rfecv = RFECV(estimator=model, step=1, cv=5, scoring='r2')
rfecv.fit(X_train, y_train)
selected_features_rfecv = [feature_names[i] for i in range(len(feature_names)) if rfecv.support_[i]]
print(f"Optimal number of features: {rfecv.n_features_}")
print(f"Selected features: {selected_features_rfecv}")
print(f"Cross-validation score: {rfecv.grid_scores_.mean():.4f}")
print()
# 3. Manual implementation of Sequential Feature Selection (SFS)
print("3. Sequential Feature Selection (SFS):")
def sequential_feature_selection(X, y, model, n_features, direction='forward'):
"""Sequential feature selection
direction: 'forward' or 'backward'
"""
n_features_total = X.shape[1]
if direction == 'forward':
# Forward selection
selected_features = []
remaining_features = list(range(n_features_total))
while len(selected_features) < n_features:
best_score = -np.inf
best_feature = None
for feature in remaining_features:
# Temporary feature set
temp_features = selected_features + [feature]
# Cross-validation
score = cross_val_score(model, X[:, temp_features], y, cv=5, scoring='r2').mean()
# Update best feature
if score > best_score:
best_score = score
best_feature = feature
if best_feature is not None:
selected_features.append(best_feature)
remaining_features.remove(best_feature)
print(f"Adding feature {feature_names[best_feature]}, score: {best_score:.4f}")
elif direction == 'backward':
# Backward selection
selected_features = list(range(n_features_total))
while len(selected_features) > n_features:
worst_score = np.inf
worst_feature = None
for feature in selected_features:
# Temporary feature set
temp_features = [f for f in selected_features if f != feature]
# Cross-validation
score = cross_val_score(model, X[:, temp_features], y, cv=5, scoring='r2').mean()
# Update worst feature
if score < worst_score:
worst_score = score
worst_feature = feature
if worst_feature is not None:
selected_features.remove(worst_feature)
print(f"Removing feature {feature_names[worst_feature]}, score: {worst_score:.4f}")
return selected_features
# Use forward selection
print("Forward selection:")
model = LinearRegression()
selected_indices = sequential_feature_selection(X_train, y_train, model, n_features=5, direction='forward')
selected_features_sfs = [feature_names[i] for i in selected_indices]
print(f"Selected features: {selected_features_sfs}")
print()
# Use backward selection
print("Backward selection:")
selected_indices = sequential_feature_selection(X_train, y_train, model, n_features=5, direction='backward')
selected_features_sbs = [feature_names[i] for i in selected_indices]
print(f"Selected features: {selected_features_sbs}")
3. Embedded Feature Selection
Embedded feature selection automatically selects features during model training. Common methods include:
- L1 regularization (Lasso)
- Tree model feature importance
- Elastic Net
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.linear_model import Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Load dataset
diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target
feature_names = diabetes.feature_names
print(f"Original number of features: {X.shape[1]}")
print(f"Feature names: {feature_names}")
print()
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Standardization
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# 1. L1 regularization (Lasso)
print("1. L1 regularization (Lasso):")
alphas = [0.001, 0.01, 0.1, 1, 10]
for alpha in alphas:
lasso = Lasso(alpha=alpha, random_state=42)
lasso.fit(X_train_scaled, y_train)
coeffs = lasso.coef_
non_zero_coeffs = np.sum(coeffs != 0)
print(f"Alpha={alpha}: Number of non-zero coefficients={non_zero_coeffs}")
print("Coefficients:")
for feature, coeff in zip(feature_names, coeffs):
print(f"{feature}: {coeff:.4f}")
print()
# 2. Elastic Net
print("2. Elastic Net:")
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5, random_state=42)
elastic_net.fit(X_train_scaled, y_train)
coeffs = elastic_net.coef_
non_zero_coeffs = np.sum(coeffs != 0)
print(f"Number of non-zero coefficients: {non_zero_coeffs}")
print("Coefficients:")
for feature, coeff in zip(feature_names, coeffs):
print(f"{feature}: {coeff:.4f}")
print()
# 3. Random Forest feature importance
print("3. Random Forest feature importance:")
rforest = RandomForestRegressor(n_estimators=100, random_state=42)
rforest.fit(X_train, y_train)
importances = rforest.feature_importances_
indices = np.argsort(importances)[::-1]
print("Feature importance ranking:")
for f in range(X.shape[1]):
print(f"{f+1}. {feature_names[indices[f]]}: {importances[indices[f]]:.4f}")
print()
# 4. Gradient Boosting feature importance
print("4. Gradient Boosting feature importance:")
gboost = GradientBoostingRegressor(n_estimators=100, random_state=42)
gboost.fit(X_train, y_train)
importances = gboost.feature_importances_
indices = np.argsort(importances)[::-1]
print("Feature importance ranking:")
for f in range(X.shape[1]):
print(f"{f+1}. {feature_names[indices[f]]}: {importances[indices[f]]:.4f}")
print()
# 5. Visualize feature importance
print("5. Visualizing feature importance:")
plt.figure(figsize=(12, 6))
# Random Forest
plt.subplot(1, 2, 1)
sorted_importances = importances[indices]
sorted_features = [feature_names[i] for i in indices]
plt.bar(range(X.shape[1]), sorted_importances)
plt.xticks(range(X.shape[1]), sorted_features, rotation=45)
plt.title('Random Forest Feature Importances')
plt.tight_layout()
# Lasso coefficients
plt.subplot(1, 2, 2)
lasso = Lasso(alpha=0.01, random_state=42)
lasso.fit(X_train_scaled, y_train)
coeffs = lasso.coef_
plt.bar(range(X.shape[1]), coeffs)
plt.xticks(range(X.shape[1]), feature_names, rotation=45)
plt.title('Lasso Coefficients')
plt.tight_layout()
plt.show()
Feature Extraction
1. Principal Component Analysis (PCA)
Principal Component Analysis is a linear dimensionality reduction technique that transforms the original feature space into a set of linearly uncorrelated principal components, preserving the most important variance in the data.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Load dataset
digits = load_digits()
X = digits.data
y = digits.target
print(f"Original data shape: {X.shape}")
print(f"Number of classes: {len(np.unique(y))}")
print()
# Standardization
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply PCA
print("Applying PCA:")
pca = PCA()
pca.fit(X_scaled)
# Calculate cumulative explained variance ratio
cumulative_variance_ratio = np.cumsum(pca.explained_variance_ratio_)
# Visualize explained variance ratio
plt.figure(figsize=(12, 6))
plt.plot(range(1, len(cumulative_variance_ratio) + 1), cumulative_variance_ratio, 'o-')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance Ratio')
plt.title('PCA Cumulative Explained Variance')
plt.grid(True)
plt.axhline(y=0.95, color='r', linestyle='--')
plt.axhline(y=0.90, color='g', linestyle='--')
plt.show()
# Select number of components to retain 95% variance
n_components_95 = np.argmax(cumulative_variance_ratio >= 0.95) + 1
print(f"Number of components needed to retain 95% variance: {n_components_95}")
# Select number of components to retain 90% variance
n_components_90 = np.argmax(cumulative_variance_ratio >= 0.90) + 1
print(f"Number of components needed to retain 90% variance: {n_components_90}")
print()
# Use PCA for dimensionality reduction
pca = PCA(n_components=n_components_95)
X_pca = pca.fit_transform(X_scaled)
print(f"Data shape after dimensionality reduction: {X_pca.shape}")
print()
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size=0.2, random_state=42)
# Train model
model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train, y_train)
# Evaluate model
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model accuracy: {accuracy:.4f}")
print()
# Visualize first two principal components
plt.figure(figsize=(12, 10))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='tab10', s=50, alpha=0.8)
plt.colorbar(scatter)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('Digits Dataset Visualization with PCA')
plt.grid(True)
plt.show()
2. t-SNE
t-SNE (t-Distributed Stochastic Neighbor Embedding) is a nonlinear dimensionality reduction technique particularly suitable for visualizing high-dimensional data. It preserves local similarities between data points to construct low-dimensional representations.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
# Load dataset
digits = load_digits()
X = digits.data
y = digits.target
print(f"Original data shape: {X.shape}")
print(f"Number of classes: {len(np.unique(y))}")
print()
# Standardization
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply t-SNE
tSNE = TSNE(n_components=2, perplexity=30, learning_rate=200, random_state=42)
X_tsne = tSNE.fit_transform(X_scaled)
print(f"Data shape after t-SNE dimensionality reduction: {X_tsne.shape}")
print()
# Visualize t-SNE results
plt.figure(figsize=(12, 10))
scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='tab10', s=50, alpha=0.8)
plt.colorbar(scatter)
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.title('Digits Dataset Visualization with t-SNE')
plt.grid(True)
plt.show()
# Try different perplexity values
perplexities = [5, 10, 30, 50]
plt.figure(figsize=(20, 15))
for i, perplexity in enumerate(perplexities):
plt.subplot(2, 2, i+1)
tSNE = TSNE(n_components=2, perplexity=perplexity, learning_rate=200, random_state=42)
X_tsne = tSNE.fit_transform(X_scaled)
scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='tab10', s=50, alpha=0.8)
plt.colorbar(scatter)
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.title(f'Digits Dataset Visualization with t-SNE (perplexity={perplexity})')
plt.grid(True)
plt.tight_layout()
plt.show()
3. LDA (Linear Discriminant Analysis)
LDA (Linear Discriminant Analysis) is a supervised dimensionality reduction technique that builds discriminants by maximizing between-class variance and minimizing within-class variance.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Load dataset
digits = load_digits()
X = digits.data
y = digits.target
print(f"Original data shape: {X.shape}")
print(f"Number of classes: {len(np.unique(y))}")
print()
# Standardization
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply LDA
print("Applying LDA:")
lda = LinearDiscriminantAnalysis()
X_lda = lda.fit_transform(X_scaled, y)
print(f"Data shape after LDA dimensionality reduction: {X_lda.shape}")
print(f"Explained variance ratio: {lda.explained_variance_ratio_}")
print(f"Cumulative explained variance: {np.sum(lda.explained_variance_ratio_):.4f}")
print()
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_lda, y, test_size=0.2, random_state=42)
# Train model
model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train, y_train)
# Evaluate model
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model accuracy: {accuracy:.4f}")
print()
# Visualize first two discriminant components
plt.figure(figsize=(12, 10))
scatter = plt.scatter(X_lda[:, 0], X_lda[:, 1], c=y, cmap='tab10', s=50, alpha=0.8)
plt.colorbar(scatter)
plt.xlabel('LD 1')
plt.ylabel('LD 2')
plt.title('Digits Dataset Visualization with LDA')
plt.grid(True)
plt.show()
Practical Case: complete Feature Engineering Workflow
In this practical case, we'll use the Titanic dataset to demonstrate the complete feature engineering workflow, including feature creation, feature selection, and model training.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Load dataset
url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
df = pd.read_csv(url)
print("Original data shape:", df.shape)
print("Original data columns:", df.columns.tolist())
print()
# View first few rows
print("First few rows:")
print(df.head())
print()
# View data information
print("Data information:")
print(df.info())
print()
# View missing values
print("Missing value statistics:")
print(df.isnull().sum())
print()
# 1. Feature creation
print("1. Feature creation")
print("-" * 50)
# Extract title from name
df['Title'] = df['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)
print("Title statistics:")
print(df['Title'].value_counts())
print()
# Combine rare titles
title_replacements = {
'Mlle': 'Miss', 'Ms': 'Miss', 'Mme': 'Mrs',
'Dr': 'Rare', 'Rev': 'Rare', 'Col': 'Rare',
'Major': 'Rare', 'Sir': 'Rare', 'Lady': 'Rare',
'Capt': 'Rare', 'Countess': 'Rare', 'Don': 'Rare',
'Jonkheer': 'Rare', 'Dona': 'Rare'
}
df['Title'] = df['Title'].replace(title_replacements)
print("Title statistics after combining:")
print(df['Title'].value_counts())
print()
# Extract deck from cabin
df['Deck'] = df['Cabin'].str[0]
print("Deck statistics:")
print(df['Deck'].value_counts())
print()
# Calculate family size
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
print("Family size statistics:")
print(df['FamilySize'].value_counts())
print()
# Create family category
df['FamilyCategory'] = pd.cut(df['FamilySize'], bins=[0, 1, 4, 11], labels=['Single', 'Small', 'Large'])
print("Family category statistics:")
print(df['FamilyCategory'].value_counts())
print()
# Create age category
df['AgeCategory'] = pd.cut(df['Age'], bins=[0, 12, 18, 60, 100], labels=['Child', 'Teen', 'Adult', 'Elder'])
print("Age category statistics:")
print(df['AgeCategory'].value_counts())
print()
# Create fare category
df['FareCategory'] = pd.qcut(df['Fare'], q=4, labels=['Low', 'Medium', 'High', 'VeryHigh'])
print("Fare category statistics:")
print(df['FareCategory'].value_counts())
print()
# 2. Feature selection
print("2. Feature selection")
print("-" * 50)
# Select features
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked',
'Title', 'Deck', 'FamilySize', 'FamilyCategory', 'AgeCategory', 'FareCategory']
X = df[features]
y = df['Survived']
print(f"Number of features: {len(features)}")
print(f"Feature list: {features}")
print()
# 3. Data preprocessing and model training
print("3. Data preprocessing and model training")
print("-" * 50)
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Training set shape: {X_train.shape}")
print(f"Test set shape: {X_test.shape}")
print()
# Define preprocessing steps
numeric_features = ['Age', 'SibSp', 'Parch', 'Fare', 'FamilySize']
categorical_features = ['Pclass', 'Sex', 'Embarked', 'Title', 'Deck', 'FamilyCategory', 'AgeCategory', 'FareCategory']
preprocessor = ColumnTransformer(
transformers=[
('num', Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
]), numeric_features),
('cat', Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OneHotEncoder(handle_unknown='ignore'))
]), categorical_features)
]
)
# Create complete pipeline
pipeline = Pipeline([
('preprocessor', preprocessor),
('feature_selector', SelectFromModel(RandomForestClassifier(n_estimators=100, random_state=42))),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
# Train model
pipeline.fit(X_train, y_train)
# Evaluate model
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model accuracy: {accuracy:.4f}")
print()
print("Classification report:")
print(classification_report(y_test, y_pred))
print()
print("Confusion matrix:")
print(confusion_matrix(y_test, y_pred))
print()
# Cross-validation
cv_scores = cross_val_score(pipeline, X, y, cv=5, scoring='accuracy')
print(f"Cross-validation accuracy: {cv_scores.mean():.4f} (±{cv_scores.std():.4f})")
print()
# Feature importance
# Get preprocessed feature names
onehot_features = pipeline.named_steps['preprocessor'].named_transformers_['cat'].named_steps['encoder'].get_feature_names_out(categorical_features)
all_features = numeric_features + list(onehot_features)
# Get feature selector
feature_selector = pipeline.named_steps['feature_selector']
selected_features_mask = feature_selector.get_support()
selected_features = [feature for feature, selected in zip(all_features, selected_features_mask) if selected]
# Get classifier
classifier = pipeline.named_steps['classifier']
importances = classifier.feature_importances_
# Sort feature importance
feature_importance_df = pd.DataFrame({'Feature': selected_features, 'Importance': importances})
feature_importance_df = feature_importance_df.sort_values('Importance', ascending=False)
print("Feature importance ranking:")
print(feature_importance_df)
print()
# Visualize feature importance
plt.figure(figsize=(12, 10))
plt.barh(feature_importance_df['Feature'], feature_importance_df['Importance'])
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.title('Feature Importance')
plt.grid(True)
plt.tight_layout()
plt.show()
Interactive Exercises
Exercise 1: Feature Creation
Use scikit-learn's california_housing dataset to create new features and evaluate their impact on model performance.
- Load the california_housing dataset
- Create the following new features:
- Rooms per population ratio
- Bedrooms per room ratio
- Population density (population per household)
- Income category (using quantiles)
- Age category (using quantiles)
- Train a linear regression model, comparing performance with original features and with added features
- Evaluate model performance using cross-validation
Exercise 2: Feature Selection
Use scikit-learn's breast_cancer dataset to apply different feature selection methods and compare results.
- Load the breast_cancer dataset
- Apply the following feature selection methods:
- Variance threshold
- Pearson correlation coefficient
- Mutual information
- Recursive Feature Elimination (RFE)
- Random Forest feature importance
- For each method, select the top 10 most important features
- Train a logistic regression model and compare performance with different feature subsets
- Visualize feature importance rankings
Exercise 3: Feature Extraction
Use scikit-learn's digits dataset to apply different feature extraction methods and compare results.
- Load the digits dataset
- Apply the following feature extraction methods:
- PCA (retaining 95% variance)
- t-SNE (n_components=2)
- LDA
- For each method, train a classification model and evaluate performance
- Visualize the dimensionality reduction results
- Compare performance and visualization效果 of different methods
Recommended Tutorials
Model Evaluation
Learn how to evaluate machine learning model performance, choose appropriate evaluation metrics, and avoid overfitting and underfitting
View TutorialModel deploymentment
Learn how to deploy trained machine learning models to production environments
View Tutorial