Feature Engineering Overview

Feature engineering is a critical step in the machine learning workflow that involves creating, selecting, and transforming features to improve model performance and generalization. Good feature engineering can:

  • Improve model prediction accuracy
  • Reduce model training time
  • Enhance model generalization and reduce overfitting
  • Make models more stable and interpretable

Feature engineering mainly includes the following aspects:

  • Feature creation: Creating new features from raw data
  • Feature selection: Selecting features that are most helpful for prediction
  • Feature extraction: Extracting low-dimensional representations from high-dimensional data
  • Feature transformation: Transforming features through standardization, normalization, etc.

Feature Creation

1. Numeric Feature Creation

Numeric feature creation involves creating new features from raw numeric data. Common methods include:

  • Polynomial features
  • Interaction features
  • Statistical features (mean, variance, quantiles, etc.)
  • Time features (year, month, day, week, etc.)
import numpy as np import pandas as pd from sklearn.preprocessing import PolynomialFeatures # Create sample data data = pd.DataFrame({ 'age': [25, 30, 35, 40, 45], 'income': [50000, 60000, 70000, 80000, 90000], 'expenses': [30000, 35000, 40000, 45000, 50000] }) print("Original data:") print(data) print() # Create new feature: savings rate data['savings_rate'] = (data['income'] - data['expenses']) / data['income'] # Create new feature: age-income interaction data['age_income'] = data['age'] * data['income'] # Create polynomial features poly = PolynomialFeatures(degree=2, include_bias=False) poly_features = poly.fit_transform(data[['age', 'income']]) poly_feature_names = poly.get_feature_names_out(['age', 'income']) poly_data = pd.DataFrame(poly_features, columns=poly_feature_names) data = pd.concat([data, poly_data], axis=1) print("Data after adding new features:") print(data) print() # Create statistical features data['income_mean'] = data['income'].mean() data['income_std'] = data['income'].std() data['income_min'] = data['income'].min() data['income_max'] = data['income'].max() print("Data after adding statistical features:") print(data)

2. Categorical Feature Creation

Categorical feature creation involves creating new features from raw categorical data. Common methods include:

  • Frequency encoding
  • Target encoding
  • WOE (Weight of Evidence) encoding
  • Combination features
import pandas as pd import numpy as np # Create sample data data = pd.DataFrame({ 'gender': ['M', 'F', 'M', 'F', 'M'], 'education': ['High School', 'College', 'College', 'Graduate', 'Graduate'], 'occupation': ['Engineer', 'Teacher', 'Doctor', 'Engineer', 'Doctor'], 'salary': [60000, 50000, 80000, 70000, 90000] }) print("Original data:") print(data) print() # Frequency encoding gender_counts = data['gender'].value_counts() data['gender_freq'] = data['gender'].map(gender_counts) education_counts = data['education'].value_counts() data['education_freq'] = data['education'].map(education_counts) # Target encoding (based on salary mean) gender_target = data.groupby('gender')['salary'].mean() data['gender_target'] = data['gender'].map(gender_target) education_target = data.groupby('education')['salary'].mean() data['education_target'] = data['education'].map(education_target) # Combination feature data['education_occupation'] = data['education'] + '_' + data['occupation'] print("Data after adding new features:") print(data) print() # WOE encoding (simplified version) # We use a binary classification example to demonstrate WOE data['high_income'] = (data['salary'] > 65000).astype(int) # Calculate WOE for feature in ['gender', 'education', 'occupation']: woe_dict = {} for category in data[feature].unique(): # Calculate the number of high and low income in this category good = len(data[(data[feature] == category) & (data['high_income'] == 1)]) bad = len(data[(data[feature] == category) & (data['high_income'] == 0)]) # Avoid division by zero good = max(good, 0.5) bad = max(bad, 0.5) # Calculate WOE woe = np.log(good / bad) woe_dict[category] = woe # Add WOE feature data[f'{feature}_woe'] = data[feature].map(woe_dict) print("Data after adding WOE features:") print(data)

3. Time Feature Creation

Time feature creation involves extracting meaningful features from time data. Common methods include:

  • Extracting year, month, day, hour, minute, second
  • Extracting day of week, month name
  • Extracting weekend/weekday, holiday indicators
  • Calculating time differences
  • Extracting seasonal features
import pandas as pd import numpy as np # Create sample time data dates = pd.date_range('2023-01-01', '2023-01-10', freq='D') data = pd.DataFrame({ 'date': dates, 'sales': np.random.randint(100, 1000, size=len(dates)) }) print("Original data:") print(data) print() # Extract time features data['year'] = data['date'].dt.year data['month'] = data['date'].dt.month data['day'] = data['date'].dt.day data['day_of_week'] = data['date'].dt.dayofweek data['weekday_name'] = data['date'].dt.day_name() data['is_weekend'] = data['day_of_week'].apply(lambda x: 1 if x >= 5 else 0) data['is_month_start'] = data['date'].dt.is_month_start.astype(int) data['is_month_end'] = data['date'].dt.is_month_end.astype(int) # Calculate time differences data['days_since_start'] = (data['date'] - data['date'].min()).dt.days # Extract seasonal features (using sine and cosine transformations) data['day_sin'] = np.sin(2 * np.pi * data['day'] / 31) data['day_cos'] = np.cos(2 * np.pi * data['day'] / 31) data['month_sin'] = np.sin(2 * np.pi * data['month'] / 12) data['month_cos'] = np.cos(2 * np.pi * data['month'] / 12) print("Data after adding time features:") print(data) print() # Check data types print("Data types:") print(data.dtypes)

Feature Selection

1. Filter-Based Feature Selection

Filter-based feature selection selects features by calculating statistical relationships between features and the target variable. Common methods include:

  • Pearson correlation coefficient
  • Chi-square test
  • Mutual information
  • Variance threshold
import pandas as pd import numpy as np from sklearn.datasets import load_diabetes from sklearn.feature_selection import SelectKBest, f_regression, chi2, mutual_info_regression, VarianceThreshold from sklearn.preprocessing import StandardScaler # Load dataset diabetes = load_diabetes() X = diabetes.data y = diabetes.target feature_names = diabetes.feature_names print(f"Original number of features: {X.shape[1]}") print(f"Feature names: {feature_names}") print() # 1. Variance threshold print("1. Variance threshold selection:") selector = VarianceThreshold(threshold=0.01) X_var = selector.fit_transform(X) selected_features_var = [feature_names[i] for i in range(len(feature_names)) if selector.get_support()[i]] print(f"Number of selected features: {X_var.shape[1]}") print(f"Selected features: {selected_features_var}") print() # 2. Pearson correlation coefficient print("2. Pearson correlation coefficient selection:") selector = SelectKBest(score_func=f_regression, k=5) X_f = selector.fit_transform(X, y) scores = selector.scores_ selected_features_f = [feature_names[i] for i in range(len(feature_names)) if selector.get_support()[i]] print(f"Number of selected features: {X_f.shape[1]}") print(f"Selected features: {selected_features_f}") print("Feature scores:") for feature, score in zip(feature_names, scores): print(f"{feature}: {score:.4f}") print() # 3. Mutual information print("3. Mutual information selection:") selector = SelectKBest(score_func=mutual_info_regression, k=5) X_mi = selector.fit_transform(X, y) mi_scores = selector.scores_ selected_features_mi = [feature_names[i] for i in range(len(feature_names)) if selector.get_support()[i]] print(f"Number of selected features: {X_mi.shape[1]}") print(f"Selected features: {selected_features_mi}") print("Feature scores:") for feature, score in zip(feature_names, mi_scores): print(f"{feature}: {score:.4f}") print() # 4. Correlation matrix print("4. Correlation matrix:") df = pd.DataFrame(X, columns=feature_names) df['target'] = y corr_matrix = df.corr() print("Correlation between features and target:") print(corr_matrix['target'].sort_values(ascending=False)) print() # 5. Remove highly correlated features print("5. Removing highly correlated features:") corr_threshold = 0.5 high_corr_features = set() for i in range(len(corr_matrix.columns)): for j in range(i): if abs(corr_matrix.iloc[i, j]) > corr_threshold: feature_i = corr_matrix.columns[i] feature_j = corr_matrix.columns[j] # Keep the feature with higher correlation to target if abs(corr_matrix.loc[feature_i, 'target']) > abs(corr_matrix.loc[feature_j, 'target']): high_corr_features.add(feature_j) else: high_corr_features.add(feature_i) print(f"Highly correlated features to remove: {high_corr_features}") selected_features_corr = [feature for feature in feature_names if feature not in high_corr_features] print(f"Selected features: {selected_features_corr}")

2. Wrapper-Based Feature Selection

Wrapper-based feature selection selects features by evaluating model performance on different feature subsets. Common methods include:

  • Recursive Feature Elimination (RFE)
  • Recursive Feature Elimination with Cross-Validation (RFECV)
  • Sequential Feature Selection (SFS)
import numpy as np from sklearn.datasets import load_diabetes from sklearn.feature_selection import RFE, RFECV from sklearn.linear_model import LinearRegression from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import train_test_split, cross_val_score # Load dataset diabetes = load_diabetes() X = diabetes.data y = diabetes.target feature_names = diabetes.feature_names print(f"Original number of features: {X.shape[1]}") print(f"Feature names: {feature_names}") print() # Split into training and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # 1. Recursive Feature Elimination (RFE) print("1. Recursive Feature Elimination (RFE):") model = LinearRegression() rfe = RFE(estimator=model, n_features_to_select=5, step=1) rfe.fit(X_train, y_train) selected_features_rfe = [feature_names[i] for i in range(len(feature_names)) if rfe.support_[i]] print(f"Number of selected features: {rfe.n_features_}") print(f"Selected features: {selected_features_rfe}") print("Feature rankings:") for feature, rank in zip(feature_names, rfe.ranking_): print(f"{feature}: {rank}") print() # 2. Recursive Feature Elimination with Cross-Validation (RFECV) print("2. Recursive Feature Elimination with Cross-Validation (RFECV):") model = RandomForestRegressor(n_estimators=100, random_state=42) rfecv = RFECV(estimator=model, step=1, cv=5, scoring='r2') rfecv.fit(X_train, y_train) selected_features_rfecv = [feature_names[i] for i in range(len(feature_names)) if rfecv.support_[i]] print(f"Optimal number of features: {rfecv.n_features_}") print(f"Selected features: {selected_features_rfecv}") print(f"Cross-validation score: {rfecv.grid_scores_.mean():.4f}") print() # 3. Manual implementation of Sequential Feature Selection (SFS) print("3. Sequential Feature Selection (SFS):") def sequential_feature_selection(X, y, model, n_features, direction='forward'): """Sequential feature selection direction: 'forward' or 'backward' """ n_features_total = X.shape[1] if direction == 'forward': # Forward selection selected_features = [] remaining_features = list(range(n_features_total)) while len(selected_features) < n_features: best_score = -np.inf best_feature = None for feature in remaining_features: # Temporary feature set temp_features = selected_features + [feature] # Cross-validation score = cross_val_score(model, X[:, temp_features], y, cv=5, scoring='r2').mean() # Update best feature if score > best_score: best_score = score best_feature = feature if best_feature is not None: selected_features.append(best_feature) remaining_features.remove(best_feature) print(f"Adding feature {feature_names[best_feature]}, score: {best_score:.4f}") elif direction == 'backward': # Backward selection selected_features = list(range(n_features_total)) while len(selected_features) > n_features: worst_score = np.inf worst_feature = None for feature in selected_features: # Temporary feature set temp_features = [f for f in selected_features if f != feature] # Cross-validation score = cross_val_score(model, X[:, temp_features], y, cv=5, scoring='r2').mean() # Update worst feature if score < worst_score: worst_score = score worst_feature = feature if worst_feature is not None: selected_features.remove(worst_feature) print(f"Removing feature {feature_names[worst_feature]}, score: {worst_score:.4f}") return selected_features # Use forward selection print("Forward selection:") model = LinearRegression() selected_indices = sequential_feature_selection(X_train, y_train, model, n_features=5, direction='forward') selected_features_sfs = [feature_names[i] for i in selected_indices] print(f"Selected features: {selected_features_sfs}") print() # Use backward selection print("Backward selection:") selected_indices = sequential_feature_selection(X_train, y_train, model, n_features=5, direction='backward') selected_features_sbs = [feature_names[i] for i in selected_indices] print(f"Selected features: {selected_features_sbs}")

3. Embedded Feature Selection

Embedded feature selection automatically selects features during model training. Common methods include:

  • L1 regularization (Lasso)
  • Tree model feature importance
  • Elastic Net
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.datasets import load_diabetes from sklearn.linear_model import Lasso, ElasticNet from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler # Load dataset diabetes = load_diabetes() X = diabetes.data y = diabetes.target feature_names = diabetes.feature_names print(f"Original number of features: {X.shape[1]}") print(f"Feature names: {feature_names}") print() # Split into training and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Standardization scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # 1. L1 regularization (Lasso) print("1. L1 regularization (Lasso):") alphas = [0.001, 0.01, 0.1, 1, 10] for alpha in alphas: lasso = Lasso(alpha=alpha, random_state=42) lasso.fit(X_train_scaled, y_train) coeffs = lasso.coef_ non_zero_coeffs = np.sum(coeffs != 0) print(f"Alpha={alpha}: Number of non-zero coefficients={non_zero_coeffs}") print("Coefficients:") for feature, coeff in zip(feature_names, coeffs): print(f"{feature}: {coeff:.4f}") print() # 2. Elastic Net print("2. Elastic Net:") elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5, random_state=42) elastic_net.fit(X_train_scaled, y_train) coeffs = elastic_net.coef_ non_zero_coeffs = np.sum(coeffs != 0) print(f"Number of non-zero coefficients: {non_zero_coeffs}") print("Coefficients:") for feature, coeff in zip(feature_names, coeffs): print(f"{feature}: {coeff:.4f}") print() # 3. Random Forest feature importance print("3. Random Forest feature importance:") rforest = RandomForestRegressor(n_estimators=100, random_state=42) rforest.fit(X_train, y_train) importances = rforest.feature_importances_ indices = np.argsort(importances)[::-1] print("Feature importance ranking:") for f in range(X.shape[1]): print(f"{f+1}. {feature_names[indices[f]]}: {importances[indices[f]]:.4f}") print() # 4. Gradient Boosting feature importance print("4. Gradient Boosting feature importance:") gboost = GradientBoostingRegressor(n_estimators=100, random_state=42) gboost.fit(X_train, y_train) importances = gboost.feature_importances_ indices = np.argsort(importances)[::-1] print("Feature importance ranking:") for f in range(X.shape[1]): print(f"{f+1}. {feature_names[indices[f]]}: {importances[indices[f]]:.4f}") print() # 5. Visualize feature importance print("5. Visualizing feature importance:") plt.figure(figsize=(12, 6)) # Random Forest plt.subplot(1, 2, 1) sorted_importances = importances[indices] sorted_features = [feature_names[i] for i in indices] plt.bar(range(X.shape[1]), sorted_importances) plt.xticks(range(X.shape[1]), sorted_features, rotation=45) plt.title('Random Forest Feature Importances') plt.tight_layout() # Lasso coefficients plt.subplot(1, 2, 2) lasso = Lasso(alpha=0.01, random_state=42) lasso.fit(X_train_scaled, y_train) coeffs = lasso.coef_ plt.bar(range(X.shape[1]), coeffs) plt.xticks(range(X.shape[1]), feature_names, rotation=45) plt.title('Lasso Coefficients') plt.tight_layout() plt.show()

Feature Extraction

1. Principal Component Analysis (PCA)

Principal Component Analysis is a linear dimensionality reduction technique that transforms the original feature space into a set of linearly uncorrelated principal components, preserving the most important variance in the data.

import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.datasets import load_digits from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score # Load dataset digits = load_digits() X = digits.data y = digits.target print(f"Original data shape: {X.shape}") print(f"Number of classes: {len(np.unique(y))}") print() # Standardization scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Apply PCA print("Applying PCA:") pca = PCA() pca.fit(X_scaled) # Calculate cumulative explained variance ratio cumulative_variance_ratio = np.cumsum(pca.explained_variance_ratio_) # Visualize explained variance ratio plt.figure(figsize=(12, 6)) plt.plot(range(1, len(cumulative_variance_ratio) + 1), cumulative_variance_ratio, 'o-') plt.xlabel('Number of Components') plt.ylabel('Cumulative Explained Variance Ratio') plt.title('PCA Cumulative Explained Variance') plt.grid(True) plt.axhline(y=0.95, color='r', linestyle='--') plt.axhline(y=0.90, color='g', linestyle='--') plt.show() # Select number of components to retain 95% variance n_components_95 = np.argmax(cumulative_variance_ratio >= 0.95) + 1 print(f"Number of components needed to retain 95% variance: {n_components_95}") # Select number of components to retain 90% variance n_components_90 = np.argmax(cumulative_variance_ratio >= 0.90) + 1 print(f"Number of components needed to retain 90% variance: {n_components_90}") print() # Use PCA for dimensionality reduction pca = PCA(n_components=n_components_95) X_pca = pca.fit_transform(X_scaled) print(f"Data shape after dimensionality reduction: {X_pca.shape}") print() # Split into training and test sets X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size=0.2, random_state=42) # Train model model = LogisticRegression(max_iter=1000, random_state=42) model.fit(X_train, y_train) # Evaluate model y_pred = model.predict(X_test) accuracy = accuracy_score(y_test, y_pred) print(f"Model accuracy: {accuracy:.4f}") print() # Visualize first two principal components plt.figure(figsize=(12, 10)) scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='tab10', s=50, alpha=0.8) plt.colorbar(scatter) plt.xlabel('Principal Component 1') plt.ylabel('Principal Component 2') plt.title('Digits Dataset Visualization with PCA') plt.grid(True) plt.show()

2. t-SNE

t-SNE (t-Distributed Stochastic Neighbor Embedding) is a nonlinear dimensionality reduction technique particularly suitable for visualizing high-dimensional data. It preserves local similarities between data points to construct low-dimensional representations.

import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import load_digits from sklearn.manifold import TSNE from sklearn.preprocessing import StandardScaler # Load dataset digits = load_digits() X = digits.data y = digits.target print(f"Original data shape: {X.shape}") print(f"Number of classes: {len(np.unique(y))}") print() # Standardization scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Apply t-SNE tSNE = TSNE(n_components=2, perplexity=30, learning_rate=200, random_state=42) X_tsne = tSNE.fit_transform(X_scaled) print(f"Data shape after t-SNE dimensionality reduction: {X_tsne.shape}") print() # Visualize t-SNE results plt.figure(figsize=(12, 10)) scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='tab10', s=50, alpha=0.8) plt.colorbar(scatter) plt.xlabel('t-SNE Component 1') plt.ylabel('t-SNE Component 2') plt.title('Digits Dataset Visualization with t-SNE') plt.grid(True) plt.show() # Try different perplexity values perplexities = [5, 10, 30, 50] plt.figure(figsize=(20, 15)) for i, perplexity in enumerate(perplexities): plt.subplot(2, 2, i+1) tSNE = TSNE(n_components=2, perplexity=perplexity, learning_rate=200, random_state=42) X_tsne = tSNE.fit_transform(X_scaled) scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='tab10', s=50, alpha=0.8) plt.colorbar(scatter) plt.xlabel('t-SNE Component 1') plt.ylabel('t-SNE Component 2') plt.title(f'Digits Dataset Visualization with t-SNE (perplexity={perplexity})') plt.grid(True) plt.tight_layout() plt.show()

3. LDA (Linear Discriminant Analysis)

LDA (Linear Discriminant Analysis) is a supervised dimensionality reduction technique that builds discriminants by maximizing between-class variance and minimizing within-class variance.

import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import load_digits from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score # Load dataset digits = load_digits() X = digits.data y = digits.target print(f"Original data shape: {X.shape}") print(f"Number of classes: {len(np.unique(y))}") print() # Standardization scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Apply LDA print("Applying LDA:") lda = LinearDiscriminantAnalysis() X_lda = lda.fit_transform(X_scaled, y) print(f"Data shape after LDA dimensionality reduction: {X_lda.shape}") print(f"Explained variance ratio: {lda.explained_variance_ratio_}") print(f"Cumulative explained variance: {np.sum(lda.explained_variance_ratio_):.4f}") print() # Split into training and test sets X_train, X_test, y_train, y_test = train_test_split(X_lda, y, test_size=0.2, random_state=42) # Train model model = LogisticRegression(max_iter=1000, random_state=42) model.fit(X_train, y_train) # Evaluate model y_pred = model.predict(X_test) accuracy = accuracy_score(y_test, y_pred) print(f"Model accuracy: {accuracy:.4f}") print() # Visualize first two discriminant components plt.figure(figsize=(12, 10)) scatter = plt.scatter(X_lda[:, 0], X_lda[:, 1], c=y, cmap='tab10', s=50, alpha=0.8) plt.colorbar(scatter) plt.xlabel('LD 1') plt.ylabel('LD 2') plt.title('Digits Dataset Visualization with LDA') plt.grid(True) plt.show()

Practical Case: complete Feature Engineering Workflow

In this practical case, we'll use the Titanic dataset to demonstrate the complete feature engineering workflow, including feature creation, feature selection, and model training.

import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split, cross_val_score from sklearn.ensemble import RandomForestClassifier from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.impute import SimpleImputer from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.feature_selection import SelectFromModel from sklearn.metrics import accuracy_score, classification_report, confusion_matrix # Load dataset url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv' df = pd.read_csv(url) print("Original data shape:", df.shape) print("Original data columns:", df.columns.tolist()) print() # View first few rows print("First few rows:") print(df.head()) print() # View data information print("Data information:") print(df.info()) print() # View missing values print("Missing value statistics:") print(df.isnull().sum()) print() # 1. Feature creation print("1. Feature creation") print("-" * 50) # Extract title from name df['Title'] = df['Name'].str.extract(' ([A-Za-z]+)\.', expand=False) print("Title statistics:") print(df['Title'].value_counts()) print() # Combine rare titles title_replacements = { 'Mlle': 'Miss', 'Ms': 'Miss', 'Mme': 'Mrs', 'Dr': 'Rare', 'Rev': 'Rare', 'Col': 'Rare', 'Major': 'Rare', 'Sir': 'Rare', 'Lady': 'Rare', 'Capt': 'Rare', 'Countess': 'Rare', 'Don': 'Rare', 'Jonkheer': 'Rare', 'Dona': 'Rare' } df['Title'] = df['Title'].replace(title_replacements) print("Title statistics after combining:") print(df['Title'].value_counts()) print() # Extract deck from cabin df['Deck'] = df['Cabin'].str[0] print("Deck statistics:") print(df['Deck'].value_counts()) print() # Calculate family size df['FamilySize'] = df['SibSp'] + df['Parch'] + 1 print("Family size statistics:") print(df['FamilySize'].value_counts()) print() # Create family category df['FamilyCategory'] = pd.cut(df['FamilySize'], bins=[0, 1, 4, 11], labels=['Single', 'Small', 'Large']) print("Family category statistics:") print(df['FamilyCategory'].value_counts()) print() # Create age category df['AgeCategory'] = pd.cut(df['Age'], bins=[0, 12, 18, 60, 100], labels=['Child', 'Teen', 'Adult', 'Elder']) print("Age category statistics:") print(df['AgeCategory'].value_counts()) print() # Create fare category df['FareCategory'] = pd.qcut(df['Fare'], q=4, labels=['Low', 'Medium', 'High', 'VeryHigh']) print("Fare category statistics:") print(df['FareCategory'].value_counts()) print() # 2. Feature selection print("2. Feature selection") print("-" * 50) # Select features features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked', 'Title', 'Deck', 'FamilySize', 'FamilyCategory', 'AgeCategory', 'FareCategory'] X = df[features] y = df['Survived'] print(f"Number of features: {len(features)}") print(f"Feature list: {features}") print() # 3. Data preprocessing and model training print("3. Data preprocessing and model training") print("-" * 50) # Split into training and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) print(f"Training set shape: {X_train.shape}") print(f"Test set shape: {X_test.shape}") print() # Define preprocessing steps numeric_features = ['Age', 'SibSp', 'Parch', 'Fare', 'FamilySize'] categorical_features = ['Pclass', 'Sex', 'Embarked', 'Title', 'Deck', 'FamilyCategory', 'AgeCategory', 'FareCategory'] preprocessor = ColumnTransformer( transformers=[ ('num', Pipeline([ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler()) ]), numeric_features), ('cat', Pipeline([ ('imputer', SimpleImputer(strategy='most_frequent')), ('encoder', OneHotEncoder(handle_unknown='ignore')) ]), categorical_features) ] ) # Create complete pipeline pipeline = Pipeline([ ('preprocessor', preprocessor), ('feature_selector', SelectFromModel(RandomForestClassifier(n_estimators=100, random_state=42))), ('classifier', RandomForestClassifier(n_estimators=100, random_state=42)) ]) # Train model pipeline.fit(X_train, y_train) # Evaluate model y_pred = pipeline.predict(X_test) accuracy = accuracy_score(y_test, y_pred) print(f"Model accuracy: {accuracy:.4f}") print() print("Classification report:") print(classification_report(y_test, y_pred)) print() print("Confusion matrix:") print(confusion_matrix(y_test, y_pred)) print() # Cross-validation cv_scores = cross_val_score(pipeline, X, y, cv=5, scoring='accuracy') print(f"Cross-validation accuracy: {cv_scores.mean():.4f} (±{cv_scores.std():.4f})") print() # Feature importance # Get preprocessed feature names onehot_features = pipeline.named_steps['preprocessor'].named_transformers_['cat'].named_steps['encoder'].get_feature_names_out(categorical_features) all_features = numeric_features + list(onehot_features) # Get feature selector feature_selector = pipeline.named_steps['feature_selector'] selected_features_mask = feature_selector.get_support() selected_features = [feature for feature, selected in zip(all_features, selected_features_mask) if selected] # Get classifier classifier = pipeline.named_steps['classifier'] importances = classifier.feature_importances_ # Sort feature importance feature_importance_df = pd.DataFrame({'Feature': selected_features, 'Importance': importances}) feature_importance_df = feature_importance_df.sort_values('Importance', ascending=False) print("Feature importance ranking:") print(feature_importance_df) print() # Visualize feature importance plt.figure(figsize=(12, 10)) plt.barh(feature_importance_df['Feature'], feature_importance_df['Importance']) plt.xlabel('Importance') plt.ylabel('Feature') plt.title('Feature Importance') plt.grid(True) plt.tight_layout() plt.show()

Interactive Exercises

Exercise 1: Feature Creation

Use scikit-learn's california_housing dataset to create new features and evaluate their impact on model performance.

  1. Load the california_housing dataset
  2. Create the following new features:
    • Rooms per population ratio
    • Bedrooms per room ratio
    • Population density (population per household)
    • Income category (using quantiles)
    • Age category (using quantiles)
  3. Train a linear regression model, comparing performance with original features and with added features
  4. Evaluate model performance using cross-validation

Exercise 2: Feature Selection

Use scikit-learn's breast_cancer dataset to apply different feature selection methods and compare results.

  1. Load the breast_cancer dataset
  2. Apply the following feature selection methods:
    • Variance threshold
    • Pearson correlation coefficient
    • Mutual information
    • Recursive Feature Elimination (RFE)
    • Random Forest feature importance
  3. For each method, select the top 10 most important features
  4. Train a logistic regression model and compare performance with different feature subsets
  5. Visualize feature importance rankings

Exercise 3: Feature Extraction

Use scikit-learn's digits dataset to apply different feature extraction methods and compare results.

  1. Load the digits dataset
  2. Apply the following feature extraction methods:
    • PCA (retaining 95% variance)
    • t-SNE (n_components=2)
    • LDA
  3. For each method, train a classification model and evaluate performance
  4. Visualize the dimensionality reduction results
  5. Compare performance and visualization效果 of different methods

Recommended Tutorials

Model Evaluation

Learn how to evaluate machine learning model performance, choose appropriate evaluation metrics, and avoid overfitting and underfitting

View Tutorial

Model deploymentment

Learn how to deploy trained machine learning models to production environments

View Tutorial

Time Series Analysis

Explore time series data analysis and forecasting methods

View Tutorial