Language Identification
Part of the Query Understanding series by Daniel Tunkelang.
Overview
Before a search system can apply language-specific processing to a query — choosing the right stemmer, spell-checker, or stop-word list — it needs to know what language the query is written in. This step is straightforward for monolingual systems but becomes essential in any multilingual context. Short queries, which are the norm in search, are particularly difficult to classify with confidence because they contain little signal. Complicating matters further, users sometimes mix languages within a single query, and brand names or proper nouns can appear identical across languages. Getting language identification right is a prerequisite for the accuracy of almost every downstream processing step.
Note: Article content behind Medium paywall — accessible at source URL with Medium account.