Unsupervised Learning Overview

Unsupervised learning is an important branch of machine learning that deals with unlabeled datasets. Unlike supervised learning, unsupervised learning algorithms don't require predefined target labels, but instead discover patterns and relationships by analyzing the intrinsic structure of the data.

Unsupervised learning mainly includes the following types of algorithms:

  • Clustering: Grouping similar data points into the same cluster
  • Dimensionality Reduction: Reducing the feature dimensions of data while preserving important information
  • Anomaly Detection: Identifying anomalies or outliers in data
  • Association Rule Learning: Discovering association relationships between variables in data

Clustering Algorithms

1. K-means Clustering

K-means is a distance-based clustering algorithm that divides data points into K clusters, where each data point belongs to the cluster with the nearest cluster center.

from sklearn.cluster import KMeans from sklearn.datasets import load_iris import matplotlib.pyplot as plt import numpy as np # Load dataset iris = load_iris() X = iris.data[:, :2] # Use only the first two features # Try different K values inertia = [] k_values = range(1, 11) for k in k_values: kmeans = KMeans(n_clusters=k, random_state=42) kmeans.fit(X) inertia.append(kmeans.inertia_) # Plot elbow graph plt.figure(figsize=(10, 6)) plt.plot(k_values, inertia, 'bo-') plt.xlabel('Number of clusters (K)') plt.ylabel('Inertia') plt.title('Elbow Method for Optimal K') plt.grid(True) plt.show() # Use optimal K value optimal_k = 3 kmeans = KMeans(n_clusters=optimal_k, random_state=42) y_pred = kmeans.fit_predict(X) # Visualize clustering results plt.figure(figsize=(10, 6)) colors = ['red', 'green', 'blue'] for i in range(optimal_k): plt.scatter(X[y_pred == i, 0], X[y_pred == i, 1], c=colors[i], label=f'Cluster {i+1}') plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=200, c='yellow', label='Centroids') plt.xlabel('Sepal length') plt.ylabel('Sepal width') plt.title('K-means Clustering on Iris Dataset') plt.legend() plt.grid(True) plt.show()

2. DBSCAN Clustering

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm that can discover arbitrarily shaped clusters and identify noise points.

from sklearn.cluster import DBSCAN from sklearn.datasets import make_moons import matplotlib.pyplot as plt # Generate synthetic data X, y = make_moons(n_samples=200, noise=0.05, random_state=42) # Use DBSCAN dbscan = DBSCAN(eps=0.3, min_samples=5) y_pred = dbscan.fit_predict(X) # Visualize clustering results plt.figure(figsize=(10, 6)) plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap='viridis') plt.xlabel('Feature 1') plt.ylabel('Feature 2') plt.title('DBSCAN Clustering on Moons Dataset') plt.grid(True) plt.show()

3. Hierarchical Clustering

Hierarchical clustering is a clustering method that builds a nested hierarchy of clusters, which can be agglomerative (bottom-up) or divisive (top-down).

from sklearn.cluster import AgglomerativeClustering from sklearn.datasets import load_iris import matplotlib.pyplot as plt from scipy.cluster.hierarchy import dendrogram, linkage # Load dataset iris = load_iris() X = iris.data[:, :2] # Use only the first two features # Generate linkage matrix for hierarchical clustering linked = linkage(X, 'ward') # Plot dendrogram plt.figure(figsize=(15, 7)) dendrogram(linked, orientation='top', distance_sort='descending', show_leaf_counts=True) plt.title('Dendrogram for Iris Dataset') plt.xlabel('Sample Index') plt.ylabel('Euclidean Distance') plt.show() # Use AgglomerativeClustering n_clusters = 3 agg_clustering = AgglomerativeClustering(n_clusters=n_clusters, linkage='ward') y_pred = agg_clustering.fit_predict(X) # Visualize clustering results plt.figure(figsize=(10, 6)) plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap='viridis') plt.xlabel('Sepal length') plt.ylabel('Sepal width') plt.title('Agglomerative Clustering on Iris Dataset') plt.grid(True) plt.show()

4. Gaussian Mixture Model

Gaussian Mixture Model (GMM) is a probabilistic clustering algorithm that assumes data is generated from a mixture of multiple Gaussian distributions.

from sklearn.mixture import GaussianMixture from sklearn.datasets import load_iris import matplotlib.pyplot as plt import numpy as np # Load dataset iris = load_iris() X = iris.data[:, :2] # Use only the first two features # Use GMM n_components = 3 gmm = GaussianMixture(n_components=n_components, random_state=42) y_pred = gmm.fit_predict(X) # Visualize clustering results plt.figure(figsize=(10, 6)) plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap='viridis') # Plot Gaussian distribution contours x = np.linspace(X[:, 0].min(), X[:, 0].max(), 100) y = np.linspace(X[:, 1].min(), X[:, 1].max(), 100) x, y = np.meshgrid(x, y) XX = np.array([x.ravel(), y.ravel()]).T Z = -gmm.score_samples(XX) Z = Z.reshape(x.shape) plt.contour(x, y, Z, levels=np.logspace(0, 2, 10), alpha=0.3) plt.xlabel('Sepal length') plt.ylabel('Sepal width') plt.title('Gaussian Mixture Model on Iris Dataset') plt.grid(True) plt.show()

Dimensionality Reduction Algorithms

1. Principal Component Analysis (PCA)

Principal Component Analysis is a linear dimensionality reduction technique that transforms the original feature space into a set of linearly independent principal components, preserving the most important variation information in the data.

from sklearn.decomposition import PCA from sklearn.datasets import load_iris import matplotlib.pyplot as plt # Load dataset iris = load_iris() X = iris.data y = iris.target # Use PCA to reduce data to 2 dimensions pca = PCA(n_components=2) X_pca = pca.fit_transform(X) # Visualize dimensionality reduction results plt.figure(figsize=(10, 6)) target_names = iris.target_names colors = ['red', 'green', 'blue'] for i, target_name in enumerate(target_names): plt.scatter(X_pca[y == i, 0], X_pca[y == i, 1], c=colors[i], label=target_name) plt.xlabel('Principal Component 1') plt.ylabel('Principal Component 2') plt.title('PCA on Iris Dataset') plt.legend() plt.grid(True) plt.show() # View explained variance ratio print("Explained variance ratio:", pca.explained_variance_ratio_) print("Cumulative explained variance:", sum(pca.explained_variance_ratio_))

2. t-SNE

t-SNE (t-Distributed Stochastic Neighbor Embedding) is a nonlinear dimensionality reduction technique particularly suitable for visualizing high-dimensional data. It builds low-dimensional representations by preserving local similarities between data points.

from sklearn.manifold import TSNE from sklearn.datasets import load_digits import matplotlib.pyplot as plt # Load dataset digits = load_digits() X = digits.data y = digits.target # Use t-SNE to reduce data to 2 dimensions tsne = TSNE(n_components=2, random_state=42) X_tsne = tsne.fit_transform(X) # Visualize dimensionality reduction results plt.figure(figsize=(12, 10)) plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='tab10', s=50, alpha=0.8) plt.colorbar() plt.title('t-SNE Visualization of Digits Dataset') plt.grid(True) plt.show()

3. Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis is a supervised dimensionality reduction technique that constructs discriminants by maximizing between-class variance and minimizing within-class variance.

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.datasets import load_iris import matplotlib.pyplot as plt # Load dataset iris = load_iris() X = iris.data y = iris.target # Use LDA to reduce data to 2 dimensions lda = LinearDiscriminantAnalysis(n_components=2) X_lda = lda.fit_transform(X, y) # Visualize dimensionality reduction results plt.figure(figsize=(10, 6)) target_names = iris.target_names colors = ['red', 'green', 'blue'] for i, target_name in enumerate(target_names): plt.scatter(X_lda[y == i, 0], X_lda[y == i, 1], c=colors[i], label=target_name) plt.xlabel('LD 1') plt.ylabel('LD 2') plt.title('LDA on Iris Dataset') plt.legend() plt.grid(True) plt.show() # View explained variance ratio print("Explained variance ratio:", lda.explained_variance_ratio_) print("Cumulative explained variance:", sum(lda.explained_variance_ratio_))

Anomaly Detection

1. Isolation Forest

Isolation Forest is a tree-based anomaly detection algorithm that isolates data points by randomly selecting features and split points. Anomalies are typically easier to isolate.

from sklearn.ensemble import IsolationForest from sklearn.datasets import make_blobs import matplotlib.pyplot as plt import numpy as np # Generate normal data X_normal, _ = make_blobs(n_samples=200, centers=1, cluster_std=0.3, random_state=42) # Add anomalies X_outliers = np.random.uniform(low=-4, high=4, size=(20, 2)) X = np.vstack([X_normal, X_outliers]) y = np.zeros(len(X), dtype=int) y[-20:] = 1 # Mark anomalies # Use Isolation Forest iso_forest = IsolationForest(contamination=0.1, random_state=42) y_pred = iso_forest.fit_predict(X) # Convert prediction results to 0 (normal) and 1 (anomaly) y_pred = (y_pred == -1).astype(int) # Visualize results plt.figure(figsize=(10, 6)) plt.scatter(X[y_pred == 0, 0], X[y_pred == 0, 1], c='blue', label='Normal') plt.scatter(X[y_pred == 1, 0], X[y_pred == 1, 1], c='red', label='Anomaly') plt.xlabel('Feature 1') plt.ylabel('Feature 2') plt.title('Isolation Forest Anomaly Detection') plt.legend() plt.grid(True) plt.show()

2. Local Outlier Factor (LOF)

Local Outlier Factor is a density-based anomaly detection algorithm that identifies anomalies by calculating the ratio of a data point's local density to the local densities of its neighbors.

from sklearn.neighbors import LocalOutlierFactor from sklearn.datasets import make_blobs import matplotlib.pyplot as plt import numpy as np # Generate normal data X_normal, _ = make_blobs(n_samples=200, centers=1, cluster_std=0.3, random_state=42) # Add anomalies X_outliers = np.random.uniform(low=-4, high=4, size=(20, 2)) X = np.vstack([X_normal, X_outliers]) y = np.zeros(len(X), dtype=int) y[-20:] = 1 # Mark anomalies # Use LOF lof = LocalOutlierFactor(n_neighbors=20, contamination=0.1) y_pred = lof.fit_predict(X) # Convert prediction results to 0 (normal) and 1 (anomaly) y_pred = (y_pred == -1).astype(int) # Visualize results plt.figure(figsize=(10, 6)) plt.scatter(X[y_pred == 0, 0], X[y_pred == 0, 1], c='blue', label='Normal') plt.scatter(X[y_pred == 1, 0], X[y_pred == 1, 1], c='red', label='Anomaly') plt.xlabel('Feature 1') plt.ylabel('Feature 2') plt.title('Local Outlier Factor Anomaly Detection') plt.legend() plt.grid(True) plt.show()

Practice Case: Customer Segmentation

In this practice case, we will use clustering algorithms to segment customers, helping businesses better understand customer groups and develop targeted marketing strategies.

from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN from sklearn.preprocessing import StandardScaler from sklearn.datasets import make_blobs import matplotlib.pyplot as plt import numpy as np # Generate synthetic customer data np.random.seed(42) # Generate 4 customer groups X1, _ = make_blobs(n_samples=100, centers=[[5, 5], [10, 10], [15, 5], [10, 15]], cluster_std=1.5, random_state=42) # Add some feature noise noise = np.random.normal(0, 0.5, X1.shape) X = X1 + noise # Data standardization scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Use K-means kmeans = KMeans(n_clusters=4, random_state=42) y_kmeans = kmeans.fit_predict(X_scaled) # Use hierarchical clustering agg = AgglomerativeClustering(n_clusters=4, linkage='ward') y_agg = agg.fit_predict(X_scaled) # Use DBSCAN dbscan = DBSCAN(eps=0.5, min_samples=5) y_dbscan = dbscan.fit_predict(X_scaled) # Visualize results fig, axes = plt.subplots(1, 3, figsize=(20, 6)) axes[0].scatter(X[:, 0], X[:, 1], c=y_kmeans, cmap='viridis') axes[0].set_title('K-means Clustering') axes[0].set_xlabel('Feature 1') axes[0].set_ylabel('Feature 2') axes[0].grid(True) axes[1].scatter(X[:, 0], X[:, 1], c=y_agg, cmap='viridis') axes[1].set_title('Agglomerative Clustering') axes[1].set_xlabel('Feature 1') axes[1].set_ylabel('Feature 2') axes[1].grid(True) axes[2].scatter(X[:, 0], X[:, 1], c=y_dbscan, cmap='viridis') axes[2].set_title('DBSCAN Clustering') axes[2].set_xlabel('Feature 1') axes[2].set_ylabel('Feature 2') axes[2].grid(True) plt.tight_layout() plt.show()

Interactive Exercises

Exercise 1: Clustering Algorithm Comparison

Use scikit-learn's make_classification function to generate a dataset and compare the performance of K-means, DBSCAN, and hierarchical clustering algorithms.

  1. Generate a dataset with 4 clusters
  2. Standardize the data
  3. Use the three clustering algorithms separately
  4. Visualize the clustering results
  5. Analyze the advantages and disadvantages of each algorithm

Exercise 2: Dimensionality Reduction Visualization

Use the digits dataset to compare the visualization effects of PCA and t-SNE dimensionality reduction techniques.

  1. Load the digits dataset
  2. Use PCA and t-SNE separately to reduce the data to 2 dimensions
  3. Visualize the dimensionality reduction results, using different colors to mark different digits
  4. Compare the visualization effects of the two methods
  5. View PCA's explained variance ratio

Exercise 3: Anomaly Detection

Use Isolation Forest and Local Outlier Factor algorithms to detect anomalies in credit card fraud data.

  1. Generate synthetic data containing normal transactions and fraudulent transactions
  2. Use the two anomaly detection algorithms
  3. Evaluate algorithm performance (accuracy, recall, F1 score)
  4. Visualize anomaly detection results

Recommended Tutorials

Supervised Learning

Deep dive into classification and regression algorithms, and master core techniques for model training and evaluation

View Tutorial

Deep Learning Basics

Explore neural networks and deep learning fundamental concepts and applications

View Tutorial

Model Evaluation

Learn model evaluation metrics and model selection methods to ensure model reliability and generalization ability

View Tutorial