Character Filtering
Part of the Query Understanding series by Daniel Tunkelang.
Overview
Character filtering is the first normalization step applied to raw query text before any higher-level processing can take place. Users enter queries in inconsistent ways — varying capitalization, punctuation, accents, and special characters — and the search system must reconcile these variations into a consistent representation. The core challenge is deciding how aggressively to normalize: stripping accents and lowercasing improves matching across spelling variations, but some characters carry meaning and should be preserved. A term that includes punctuation as part of its identity means something very different if that punctuation is removed. The right approach depends heavily on the domain and the kinds of queries users actually submit.
Note: Article content behind Medium paywall — accessible at source URL with Medium account.