Unsupervised Learning Overview
Unsupervised learning is an important branch of machine learning that deals with unlabeled datasets. Unlike supervised learning, unsupervised learning algorithms don't require predefined target labels, but instead discover patterns and relationships by analyzing the intrinsic structure of the data.
Unsupervised learning mainly includes the following types of algorithms:
- Clustering: Grouping similar data points into the same cluster
- Dimensionality Reduction: Reducing the feature dimensions of data while preserving important information
- Anomaly Detection: Identifying anomalies or outliers in data
- Association Rule Learning: Discovering association relationships between variables in data
Clustering Algorithms
1. K-means Clustering
K-means is a distance-based clustering algorithm that divides data points into K clusters, where each data point belongs to the cluster with the nearest cluster center.
from sklearn.cluster import KMeans
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
import numpy as np
# Load dataset
iris = load_iris()
X = iris.data[:, :2] # Use only the first two features
# Try different K values
inertia = []
k_values = range(1, 11)
for k in k_values:
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X)
inertia.append(kmeans.inertia_)
# Plot elbow graph
plt.figure(figsize=(10, 6))
plt.plot(k_values, inertia, 'bo-')
plt.xlabel('Number of clusters (K)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal K')
plt.grid(True)
plt.show()
# Use optimal K value
optimal_k = 3
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
y_pred = kmeans.fit_predict(X)
# Visualize clustering results
plt.figure(figsize=(10, 6))
colors = ['red', 'green', 'blue']
for i in range(optimal_k):
plt.scatter(X[y_pred == i, 0], X[y_pred == i, 1], c=colors[i], label=f'Cluster {i+1}')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=200, c='yellow', label='Centroids')
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.title('K-means Clustering on Iris Dataset')
plt.legend()
plt.grid(True)
plt.show()
2. DBSCAN Clustering
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm that can discover arbitrarily shaped clusters and identify noise points.
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons
import matplotlib.pyplot as plt
# Generate synthetic data
X, y = make_moons(n_samples=200, noise=0.05, random_state=42)
# Use DBSCAN
dbscan = DBSCAN(eps=0.3, min_samples=5)
y_pred = dbscan.fit_predict(X)
# Visualize clustering results
plt.figure(figsize=(10, 6))
plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap='viridis')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('DBSCAN Clustering on Moons Dataset')
plt.grid(True)
plt.show()
3. Hierarchical Clustering
Hierarchical clustering is a clustering method that builds a nested hierarchy of clusters, which can be agglomerative (bottom-up) or divisive (top-down).
from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
# Load dataset
iris = load_iris()
X = iris.data[:, :2] # Use only the first two features
# Generate linkage matrix for hierarchical clustering
linked = linkage(X, 'ward')
# Plot dendrogram
plt.figure(figsize=(15, 7))
dendrogram(linked, orientation='top', distance_sort='descending', show_leaf_counts=True)
plt.title('Dendrogram for Iris Dataset')
plt.xlabel('Sample Index')
plt.ylabel('Euclidean Distance')
plt.show()
# Use AgglomerativeClustering
n_clusters = 3
agg_clustering = AgglomerativeClustering(n_clusters=n_clusters, linkage='ward')
y_pred = agg_clustering.fit_predict(X)
# Visualize clustering results
plt.figure(figsize=(10, 6))
plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap='viridis')
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.title('Agglomerative Clustering on Iris Dataset')
plt.grid(True)
plt.show()
4. Gaussian Mixture Model
Gaussian Mixture Model (GMM) is a probabilistic clustering algorithm that assumes data is generated from a mixture of multiple Gaussian distributions.
from sklearn.mixture import GaussianMixture
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
import numpy as np
# Load dataset
iris = load_iris()
X = iris.data[:, :2] # Use only the first two features
# Use GMM
n_components = 3
gmm = GaussianMixture(n_components=n_components, random_state=42)
y_pred = gmm.fit_predict(X)
# Visualize clustering results
plt.figure(figsize=(10, 6))
plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap='viridis')
# Plot Gaussian distribution contours
x = np.linspace(X[:, 0].min(), X[:, 0].max(), 100)
y = np.linspace(X[:, 1].min(), X[:, 1].max(), 100)
x, y = np.meshgrid(x, y)
XX = np.array([x.ravel(), y.ravel()]).T
Z = -gmm.score_samples(XX)
Z = Z.reshape(x.shape)
plt.contour(x, y, Z, levels=np.logspace(0, 2, 10), alpha=0.3)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.title('Gaussian Mixture Model on Iris Dataset')
plt.grid(True)
plt.show()
Dimensionality Reduction Algorithms
1. Principal Component Analysis (PCA)
Principal Component Analysis is a linear dimensionality reduction technique that transforms the original feature space into a set of linearly independent principal components, preserving the most important variation information in the data.
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
# Load dataset
iris = load_iris()
X = iris.data
y = iris.target
# Use PCA to reduce data to 2 dimensions
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
# Visualize dimensionality reduction results
plt.figure(figsize=(10, 6))
target_names = iris.target_names
colors = ['red', 'green', 'blue']
for i, target_name in enumerate(target_names):
plt.scatter(X_pca[y == i, 0], X_pca[y == i, 1], c=colors[i], label=target_name)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA on Iris Dataset')
plt.legend()
plt.grid(True)
plt.show()
# View explained variance ratio
print("Explained variance ratio:", pca.explained_variance_ratio_)
print("Cumulative explained variance:", sum(pca.explained_variance_ratio_))
2. t-SNE
t-SNE (t-Distributed Stochastic Neighbor Embedding) is a nonlinear dimensionality reduction technique particularly suitable for visualizing high-dimensional data. It builds low-dimensional representations by preserving local similarities between data points.
from sklearn.manifold import TSNE
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
# Load dataset
digits = load_digits()
X = digits.data
y = digits.target
# Use t-SNE to reduce data to 2 dimensions
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X)
# Visualize dimensionality reduction results
plt.figure(figsize=(12, 10))
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='tab10', s=50, alpha=0.8)
plt.colorbar()
plt.title('t-SNE Visualization of Digits Dataset')
plt.grid(True)
plt.show()
3. Linear Discriminant Analysis (LDA)
Linear Discriminant Analysis is a supervised dimensionality reduction technique that constructs discriminants by maximizing between-class variance and minimizing within-class variance.
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
# Load dataset
iris = load_iris()
X = iris.data
y = iris.target
# Use LDA to reduce data to 2 dimensions
lda = LinearDiscriminantAnalysis(n_components=2)
X_lda = lda.fit_transform(X, y)
# Visualize dimensionality reduction results
plt.figure(figsize=(10, 6))
target_names = iris.target_names
colors = ['red', 'green', 'blue']
for i, target_name in enumerate(target_names):
plt.scatter(X_lda[y == i, 0], X_lda[y == i, 1], c=colors[i], label=target_name)
plt.xlabel('LD 1')
plt.ylabel('LD 2')
plt.title('LDA on Iris Dataset')
plt.legend()
plt.grid(True)
plt.show()
# View explained variance ratio
print("Explained variance ratio:", lda.explained_variance_ratio_)
print("Cumulative explained variance:", sum(lda.explained_variance_ratio_))
Anomaly Detection
1. Isolation Forest
Isolation Forest is a tree-based anomaly detection algorithm that isolates data points by randomly selecting features and split points. Anomalies are typically easier to isolate.
from sklearn.ensemble import IsolationForest
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
import numpy as np
# Generate normal data
X_normal, _ = make_blobs(n_samples=200, centers=1, cluster_std=0.3, random_state=42)
# Add anomalies
X_outliers = np.random.uniform(low=-4, high=4, size=(20, 2))
X = np.vstack([X_normal, X_outliers])
y = np.zeros(len(X), dtype=int)
y[-20:] = 1 # Mark anomalies
# Use Isolation Forest
iso_forest = IsolationForest(contamination=0.1, random_state=42)
y_pred = iso_forest.fit_predict(X)
# Convert prediction results to 0 (normal) and 1 (anomaly)
y_pred = (y_pred == -1).astype(int)
# Visualize results
plt.figure(figsize=(10, 6))
plt.scatter(X[y_pred == 0, 0], X[y_pred == 0, 1], c='blue', label='Normal')
plt.scatter(X[y_pred == 1, 0], X[y_pred == 1, 1], c='red', label='Anomaly')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Isolation Forest Anomaly Detection')
plt.legend()
plt.grid(True)
plt.show()
2. Local Outlier Factor (LOF)
Local Outlier Factor is a density-based anomaly detection algorithm that identifies anomalies by calculating the ratio of a data point's local density to the local densities of its neighbors.
from sklearn.neighbors import LocalOutlierFactor
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
import numpy as np
# Generate normal data
X_normal, _ = make_blobs(n_samples=200, centers=1, cluster_std=0.3, random_state=42)
# Add anomalies
X_outliers = np.random.uniform(low=-4, high=4, size=(20, 2))
X = np.vstack([X_normal, X_outliers])
y = np.zeros(len(X), dtype=int)
y[-20:] = 1 # Mark anomalies
# Use LOF
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.1)
y_pred = lof.fit_predict(X)
# Convert prediction results to 0 (normal) and 1 (anomaly)
y_pred = (y_pred == -1).astype(int)
# Visualize results
plt.figure(figsize=(10, 6))
plt.scatter(X[y_pred == 0, 0], X[y_pred == 0, 1], c='blue', label='Normal')
plt.scatter(X[y_pred == 1, 0], X[y_pred == 1, 1], c='red', label='Anomaly')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Local Outlier Factor Anomaly Detection')
plt.legend()
plt.grid(True)
plt.show()
Practice Case: Customer Segmentation
In this practice case, we will use clustering algorithms to segment customers, helping businesses better understand customer groups and develop targeted marketing strategies.
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
import numpy as np
# Generate synthetic customer data
np.random.seed(42)
# Generate 4 customer groups
X1, _ = make_blobs(n_samples=100, centers=[[5, 5], [10, 10], [15, 5], [10, 15]], cluster_std=1.5, random_state=42)
# Add some feature noise
noise = np.random.normal(0, 0.5, X1.shape)
X = X1 + noise
# Data standardization
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Use K-means
kmeans = KMeans(n_clusters=4, random_state=42)
y_kmeans = kmeans.fit_predict(X_scaled)
# Use hierarchical clustering
agg = AgglomerativeClustering(n_clusters=4, linkage='ward')
y_agg = agg.fit_predict(X_scaled)
# Use DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
y_dbscan = dbscan.fit_predict(X_scaled)
# Visualize results
fig, axes = plt.subplots(1, 3, figsize=(20, 6))
axes[0].scatter(X[:, 0], X[:, 1], c=y_kmeans, cmap='viridis')
axes[0].set_title('K-means Clustering')
axes[0].set_xlabel('Feature 1')
axes[0].set_ylabel('Feature 2')
axes[0].grid(True)
axes[1].scatter(X[:, 0], X[:, 1], c=y_agg, cmap='viridis')
axes[1].set_title('Agglomerative Clustering')
axes[1].set_xlabel('Feature 1')
axes[1].set_ylabel('Feature 2')
axes[1].grid(True)
axes[2].scatter(X[:, 0], X[:, 1], c=y_dbscan, cmap='viridis')
axes[2].set_title('DBSCAN Clustering')
axes[2].set_xlabel('Feature 1')
axes[2].set_ylabel('Feature 2')
axes[2].grid(True)
plt.tight_layout()
plt.show()
Interactive Exercises
Exercise 1: Clustering Algorithm Comparison
Use scikit-learn's make_classification function to generate a dataset and compare the performance of K-means, DBSCAN, and hierarchical clustering algorithms.
- Generate a dataset with 4 clusters
- Standardize the data
- Use the three clustering algorithms separately
- Visualize the clustering results
- Analyze the advantages and disadvantages of each algorithm
Exercise 2: Dimensionality Reduction Visualization
Use the digits dataset to compare the visualization effects of PCA and t-SNE dimensionality reduction techniques.
- Load the digits dataset
- Use PCA and t-SNE separately to reduce the data to 2 dimensions
- Visualize the dimensionality reduction results, using different colors to mark different digits
- Compare the visualization effects of the two methods
- View PCA's explained variance ratio
Exercise 3: Anomaly Detection
Use Isolation Forest and Local Outlier Factor algorithms to detect anomalies in credit card fraud data.
- Generate synthetic data containing normal transactions and fraudulent transactions
- Use the two anomaly detection algorithms
- Evaluate algorithm performance (accuracy, recall, F1 score)
- Visualize anomaly detection results
Recommended Tutorials
Supervised Learning
Deep dive into classification and regression algorithms, and master core techniques for model training and evaluation
View TutorialDeep Learning Basics
Explore neural networks and deep learning fundamental concepts and applications
View TutorialModel Evaluation
Learn model evaluation metrics and model selection methods to ensure model reliability and generalization ability
View Tutorial