How to Write a Spelling Corrector

Source: http://norvig.com/spell-correct.html Author: Peter Norvig (Google)

Summary

Classic ~21-line Python implementation of a probabilistic spelling corrector. The canonical reference for the noisy-channel model of spelling correction.

Algorithm

Bayes’ theorem to find the most probable correction:

argmax_c ∈ candidates P(c) * P(w|c)

Where:

  • P(c) = language model (probability of the word in a corpus)
  • P(w|c) = error model (probability of typing w when intending c)

Key Components

Candidate Generation

Generates candidates within edit distance 2 using four operations: deletion, transposition, replacement, insertion.

For a 9-character word: 442 candidates at edit distance 1, but after filtering to dictionary words → just a handful of valid corrections.

Language Model

Word probabilities from ~1.1M word corpus (Project Gutenberg + frequency lists). Priority: known words at edit distance 1 > edit distance 2 > unknown words.

Simplified Error Model

Prefers lower edit distance. Does not model keyboard proximity or phonetic similarity.

Results

  • Development set: 75% accuracy
  • Final test set: 68% accuracy
  • ~35-41 words/second

Limitations

  • No context awareness (surrounding words ignored)
  • Unknown words (5-11% of test cases)
  • Oversimplified error model (no keyboard layout, no phonetics)

Historical Significance

This paper launched a thousand spelling corrector implementations. Wolf Garbe built SymSpell specifically to address its O(n) computational cost at scale.