Comparison of Dimensionality Reduction Methods

Description for Comparison of Dimensionality Reduction Methods notebook.

Notebook Contents

This notebook covers:

  • Topic 1
  • Topic 2
  • Topic 3

Use the buttons above to download the notebook or open it in your preferred environment.

šŸ““ Notebook Preview

Comparing PCA, Factor Analysis, and Autoencoders¶

This notebook demonstrates three popular dimensionality reduction techniques:

  • PCA (Principal Component Analysis): Linear transformation that maximizes variance
  • FA (Factor Analysis): Identifies latent factors that explain correlations
  • AE (Autoencoder): Neural network that learns nonlinear representations

We'll use the MNIST dataset for comparison.

InĀ [Ā ]:
# Import libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA, FactorAnalysis
from sklearn.datasets import fetch_openml
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader
# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)
plt.style.use('seaborn-v0_8-darkgrid')
%matplotlib inline

1. Load and Prepare Data¶

InĀ [Ā ]:
# Load MNIST dataset (using a subset for faster computation)
print("Loading MNIST dataset...")
mnist = fetch_openml('mnist_784', version=1, parser='auto')
X, y = mnist.data[:5000], mnist.target[:5000]
# Convert to numpy arrays and normalize
X = np.array(X, dtype=np.float32) / 255.0
y = np.array(y, dtype=int)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")
InĀ [Ā ]:
# Visualize some examples
fig, axes = plt.subplots(2, 5, figsize=(12, 5))
for i, ax in enumerate(axes.flat):
    ax.imshow(X_train[i].reshape(28, 28), cmap='gray')
    ax.set_title(f"Label: {y_train[i]}")
    ax.axis('off')
plt.suptitle('Sample MNIST Digits', fontsize=14)
plt.tight_layout()
plt.show()

2. PCA (Principal Component Analysis)¶

PCA finds orthogonal directions of maximum variance in the data.

InĀ [Ā ]:
# Apply PCA with 2 components for visualization
n_components = 2
pca = PCA(n_components=n_components)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
print(f"Total variance explained: {pca.explained_variance_ratio_.sum():.4f}")
InĀ [Ā ]:
# Visualize PCA embeddings
plt.figure(figsize=(10, 8))
scatter = plt.scatter(X_train_pca[:, 0], X_train_pca[:, 1], 
                     c=y_train, cmap='tab10', alpha=0.6, s=10)
plt.colorbar(scatter, label='Digit')
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.title('PCA: 2D Projection of MNIST')
plt.grid(True, alpha=0.3)
plt.show()
InĀ [Ā ]:
# Reconstruction with PCA (using more components for better reconstruction)
pca_reconstruct = PCA(n_components=50)
X_train_pca_50 = pca_reconstruct.fit_transform(X_train)
X_reconstructed_pca = pca_reconstruct.inverse_transform(X_train_pca_50)
# Calculate reconstruction error
pca_mse = np.mean((X_train - X_reconstructed_pca) ** 2)
print(f"PCA Reconstruction MSE (50 components): {pca_mse:.6f}")

3. Factor Analysis¶

FA models the data as a linear combination of latent factors plus noise.

InĀ [Ā ]:
# Apply Factor Analysis with 2 components
fa = FactorAnalysis(n_components=n_components, random_state=42)
X_train_fa = fa.fit_transform(X_train)
X_test_fa = fa.transform(X_test)
print(f"Factor Analysis noise variance (first 10 features): {fa.noise_variance_[:10]}")
InĀ [Ā ]:
# Visualize FA embeddings
plt.figure(figsize=(10, 8))
scatter = plt.scatter(X_train_fa[:, 0], X_train_fa[:, 1], 
                     c=y_train, cmap='tab10', alpha=0.6, s=10)
plt.colorbar(scatter, label='Digit')
plt.xlabel('First Factor')
plt.ylabel('Second Factor')
plt.title('Factor Analysis: 2D Projection of MNIST')
plt.grid(True, alpha=0.3)
plt.show()
InĀ [Ā ]:
# Reconstruction with Factor Analysis
fa_reconstruct = FactorAnalysis(n_components=50, random_state=42)
X_train_fa_50 = fa_reconstruct.fit_transform(X_train)
X_reconstructed_fa = fa_reconstruct.inverse_transform(X_train_fa_50)
# Calculate reconstruction error
fa_mse = np.mean((X_train - X_reconstructed_fa) ** 2)
print(f"FA Reconstruction MSE (50 components): {fa_mse:.6f}")

4. Autoencoder¶

A neural network that learns a compressed representation through backpropagation.

InĀ [Ā ]:
# Define the Autoencoder architecture
class Autoencoder(nn.Module):
    def __init__(self, input_dim=784, encoding_dim=2):
        super(Autoencoder, self).__init__()
        # Encoder
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Linear(128, encoding_dim)
        )
        # Decoder
        self.decoder = nn.Sequential(
            nn.Linear(encoding_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 256),
            nn.ReLU(),
            nn.Linear(256, input_dim),
            nn.Sigmoid()  # Output in [0, 1] range
        )
    def forward(self, x):
        encoded = self.encoder(x)
        decoded = self.decoder(encoded)
        return decoded
    def encode(self, x):
        return self.encoder(x)
# Initialize model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
autoencoder = Autoencoder(input_dim=784, encoding_dim=2).to(device)
print(autoencoder)
InĀ [Ā ]:
# Prepare data loaders
X_train_tensor = torch.FloatTensor(X_train)
X_test_tensor = torch.FloatTensor(X_test)
train_dataset = TensorDataset(X_train_tensor, X_train_tensor)
train_loader = DataLoader(train_dataset, batch_size=128, shuffle=True)
# Training setup
criterion = nn.MSELoss()
optimizer = optim.Adam(autoencoder.parameters(), lr=0.001)
# Training loop
num_epochs = 20
train_losses = []
print("Training Autoencoder...")
for epoch in range(num_epochs):
    autoencoder.train()
    epoch_loss = 0
    for batch_x, _ in train_loader:
        batch_x = batch_x.to(device)
        # Forward pass
        outputs = autoencoder(batch_x)
        loss = criterion(outputs, batch_x)
        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
    avg_loss = epoch_loss / len(train_loader)
    train_losses.append(avg_loss)
    if (epoch + 1) % 5 == 0:
        print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {avg_loss:.6f}")
print("Training complete!")
InĀ [Ā ]:
# Plot training loss
plt.figure(figsize=(10, 5))
plt.plot(train_losses, linewidth=2)
plt.xlabel('Epoch')
plt.ylabel('Loss (MSE)')
plt.title('Autoencoder Training Loss')
plt.grid(True, alpha=0.3)
plt.show()
InĀ [Ā ]:
# Encode the data
autoencoder.eval()
with torch.no_grad():
    X_train_ae = autoencoder.encode(X_train_tensor.to(device)).cpu().numpy()
    X_test_ae = autoencoder.encode(X_test_tensor.to(device)).cpu().numpy()
# Visualize AE embeddings
plt.figure(figsize=(10, 8))
scatter = plt.scatter(X_train_ae[:, 0], X_train_ae[:, 1], 
                     c=y_train, cmap='tab10', alpha=0.6, s=10)
plt.colorbar(scatter, label='Digit')
plt.xlabel('First Encoding Dimension')
plt.ylabel('Second Encoding Dimension')
plt.title('Autoencoder: 2D Projection of MNIST')
plt.grid(True, alpha=0.3)
plt.show()
InĀ [Ā ]:
# Reconstruction with Autoencoder
with torch.no_grad():
    X_reconstructed_ae = autoencoder(X_train_tensor.to(device)).cpu().numpy()
# Calculate reconstruction error
ae_mse = np.mean((X_train - X_reconstructed_ae) ** 2)
print(f"Autoencoder Reconstruction MSE (2D encoding): {ae_mse:.6f}")

5. Comparison of Reconstructions¶

InĀ [Ā ]:
# Visualize reconstructions
n_samples = 5
fig, axes = plt.subplots(4, n_samples, figsize=(15, 10))
for i in range(n_samples):
    # Original
    axes[0, i].imshow(X_train[i].reshape(28, 28), cmap='gray')
    axes[0, i].axis('off')
    if i == 0:
        axes[0, i].set_ylabel('Original', fontsize=12, rotation=0, ha='right')
    # PCA reconstruction
    axes[1, i].imshow(X_reconstructed_pca[i].reshape(28, 28), cmap='gray')
    axes[1, i].axis('off')
    if i == 0:
        axes[1, i].set_ylabel('PCA\n(50 comp)', fontsize=12, rotation=0, ha='right')
    # FA reconstruction
    axes[2, i].imshow(X_reconstructed_fa[i].reshape(28, 28), cmap='gray')
    axes[2, i].axis('off')
    if i == 0:
        axes[2, i].set_ylabel('FA\n(50 comp)', fontsize=12, rotation=0, ha='right')
    # AE reconstruction
    axes[3, i].imshow(X_reconstructed_ae[i].reshape(28, 28), cmap='gray')
    axes[3, i].axis('off')
    if i == 0:
        axes[3, i].set_ylabel('Autoencoder\n(2D)', fontsize=12, rotation=0, ha='right')
plt.suptitle('Reconstruction Comparison', fontsize=16, y=0.98)
plt.tight_layout()
plt.show()

6. Summary and Key Differences¶

InĀ [Ā ]:
# Summary of reconstruction errors
print("="*60)
print("RECONSTRUCTION ERROR COMPARISON")
print("="*60)
print(f"PCA (50 components):        MSE = {pca_mse:.6f}")
print(f"Factor Analysis (50 comp):  MSE = {fa_mse:.6f}")
print(f"Autoencoder (2D encoding):  MSE = {ae_mse:.6f}")
print("="*60)
print()
print("KEY DIFFERENCES:")
print("-" * 60)
print("PCA:")
print("  • Linear transformation")
print("  • Maximizes variance")
print("  • Orthogonal components")
print("  • Fast and deterministic")
print()
print("Factor Analysis:")
print("  • Assumes latent factors + noise")
print("  • Models correlations between features")
print("  • Factors need not be orthogonal")
print("  • Provides noise variance estimates")
print()
print("Autoencoder:")
print("  • Nonlinear transformation")
print("  • Learns hierarchical features")
print("  • More flexible but computationally expensive")
print("  • Requires training (non-deterministic)")
print("="*60)

7. When to Use Each Method¶

Use PCA when:

  • You need a fast, deterministic solution
  • The data has linear structure
  • You want to understand variance explained
  • Orthogonal components are desired

Use Factor Analysis when:

  • You want to model latent factors
  • Understanding noise structure is important
  • You need a probabilistic interpretation
  • Data has correlations you want to explain

Use Autoencoders when:

  • Data has complex, nonlinear structure
  • You have sufficient training data
  • Computational resources allow training
  • You need very low-dimensional representations
  • Flexibility in architecture is valuable