PCA (Principal Component Analysis)

Definition

PCA is a linear dimensionality reduction technique that projects data onto a new coordinate system whose axes (principal components) are ordered by the amount of variance they explain. The first principal component captures the most variance; each subsequent one captures the next most, orthogonal to all previous.

How It Works

  1. Standardize — subtract mean, divide by std per feature
  2. Covariance matrix — n×n matrix summarizing pairwise feature correlations
  3. Eigendecomposition — eigenvectors = principal component directions; eigenvalues = variance explained
  4. Sort eigenvectors by eigenvalue descending
  5. Project — multiply original data by the top-k eigenvectors matrix → reduced representation

Properties

PropertyValue
TypeLinear
PreservesGlobal variance structure
SpeedFast (one-pass, deterministic)
ParametricYes — reusable on new data
Inverse transformYes (lossy)
Suitable for MLYes
  • Can reduce embedding dimensions (e.g., 768→256) before ANN indexing, cutting memory and speeding up search with modest quality loss
  • Useful when embedding dimensions have near-zero variance (exploited in HNSW with PCA preprocessing)
  • Alternative to Vector Quantization: DR reduces the number of dimensions; quantization reduces bits per dimension — both are often combined

Limitations

  • Linear only — cannot capture non-linear manifolds in the data
  • Information loss is unavoidable (unless eigenvalues are zero)
  • Principal components are hard to interpret in original feature terms
  • Dimensionality Reduction — parent concept; also covers t-SNE, UMAP, Matryoshka
  • t-SNE — non-linear alternative for visualization
  • UMAP — non-linear alternative with parametric option
  • Matryoshka Embeddings — training-time alternative; dimension-flexible without projection
  • Vector Quantization — complementary compression approach
  • HNSW — ANN index that benefits from reduced dimensionality

Articles