Steering Vectors

A direction in a model’s activation space that, when added to the hidden states at inference time, shifts the model’s behavior toward (or away from) a target attribute — tone, refusal, a topic, a style. Constructed by contrasting activations on examples that have the attribute against those that don’t (closely related to concept activation vectors), then adding a scaled copy of that difference vector during the forward pass. The basis of “activation steering” / representation engineering for LLMs.

Steering vectors rely on the linear representation view: meaningful concepts are directions in a point-vector space. This is the same point-vector paradigm that region representations like box embeddings extend — boxes add volume and containment to “meaning” instead of treating it purely as a direction.

Note: an LLM interpretability / control technique, related here thematically rather than from the box-embedding article — both are about imposing useful geometric structure on learned representations.

  • Concept Vectors — the concept directions steering vectors add
  • Embeddings — the representation space steering operates on
  • Box Embedding — region-based view of meaning vs. direction-based steering

Techniques (Representation Engineering)

In representation engineering, steering splits into two stages: representation reading (find the concept direction from contrast pairs, e.g. via LAT) and representation steering (inject it). Common steering methods:

  • Contrastive Activation Addition — add a scaled contrast-derived vector to the residual stream during the forward pass.
  • Spectral editing — more efficient targeted edits.
  • Mean-centering — yields more effective steering directions.

Reported uses: reducing sycophancy, improving truthfulness, preference alignment, and jailbreak detection.

Reference