Multilingual Search

Serving search across multiple languages — or mixed-language queries — requires solutions at every layer of the stack: tokenization, retrieval, ranking, and evaluation. A system that works well in English will degrade sharply when queries and documents don’t share the same language.

Why Multilingual Search Is Hard

Challenge	Root Cause
Vocabulary mismatch	Query in Language A, document in Language B — BM25 scores zero
Script diversity	Latin, CJK, Arabic, Devanagari need different tokenizers
Mixed-language queries	”iPhone 13 price RM” mixes English brand + Malay currency
Low-resource languages	Sparse training data; embeddings may not generalize
Transliteration	Same word spelled multiple ways across scripts
Evaluation gap	Judgment collection is expensive in each language

Retrieval Approaches

1. Machine Translation (MT) Pipeline

Translate queries into a single canonical language (usually English), then search a monolingual index.

Pros: Reuses existing monolingual infrastructure.
Cons: Translation latency; errors compound; loses rare-word precision.

2. Per-Language Indexes + Federated Search

Maintain separate indexes per language, fan out the query, merge results.

Pros: Each index is optimized for its language.
Cons: Infrastructure cost; merging scores across heterogeneous indexes is non-trivial.

3. Multilingual Dense Retrieval

Encode queries and documents with a shared multilingual encoder (e.g., mBERT, XLM-R, E5-multilingual, Qwen3 multilingual). Queries and documents from any language land in the same vector space.

Pros: Single index; cross-lingual retrieval without translation; handles mixed-language queries naturally.
Cons: Embedding quality varies by language; requires GPU infrastructure; ANN index size grows with document count.

Key models (2024–2025):

intfloat/multilingual-e5-large — strong general-purpose, 100+ languages
Alibaba-NLP/gte-Qwen2-1.5B-instruct / Qwen3 multilingual — state-of-the-art on MTEB multilingual
cohere/multilingual-22-12 / embed-multilingual-v3.0 — production-grade via API

4. Hybrid: BM25 + Multilingual Dense

Run BM25 per-language alongside a multilingual dense retriever; fuse with RRF or learned weights. Best of both worlds for high-recall + cross-lingual.

Tokenization Matters

BM25 and sparse models depend heavily on tokenizers:

CJK (Chinese, Japanese, Korean): Character-level or n-gram tokenization; word boundaries are implicit.
Arabic/Hebrew: Right-to-left; morphologically rich; stemming critical.
Agglutinative languages (Finnish, Turkish, Hungarian): Compound splitting needed to avoid vocabulary explosion.

Elasticsearch / OpenSearch ship language-specific analyzers. For neural models, sentencepiece / BPE tokenizers handle scripts natively.

Mixed-Language Queries

Common in multilingual markets (Southeast Asia, India, Switzerland, Belgium):

"iPhone 15 harga terbaik"        # English brand + Malay price query
"Nike running shoes 남성용"       # English brand + Korean "for men"

Solutions:

Language detection per token (not just per query)
Multilingual embeddings handle this naturally
At ranking time, score documents in the query’s dominant language higher

Carousell serves mixed-language queries across Southeast Asian markets (Bahasa Indonesia, Tagalog, Thai, English) with dense vector retrieval to bridge vocabulary gaps in sponsored search.

Evaluation

Multilingual evaluation is expensive — you need annotators who speak each language.

Practical strategies:

Use query logs + click data as a language-agnostic signal (CTR, reformulation rate)
Stratified sampling by language in your query test set (see Query Sampling)
Automated MT + back-translation to generate cross-lingual test pairs cheaply
MTEB multilingual benchmark for model selection offline

Watch for language imbalance: a model may score well on average while failing on low-resource languages that represent a real user population.

Common Mistakes

English-first architecture: Hard to retrofit multilingual later; build multilingual retrieval from day one if you serve non-English markets.
Ignoring transliteration: “Москва” and “Moskva” are the same city — synonyms or normalization needed.
Single-language judgment pool: Offline eval that only covers English hides quality gaps in other languages.
Token budget mismatch: Cross-encoder rerankers have context limits that break down on long CJK documents if tokenized naively.

Awesome Search KG

Explorer

Multilingual Search

Multilingual Search

Why Multilingual Search Is Hard

Retrieval Approaches

1. Machine Translation (MT) Pipeline

2. Per-Language Indexes + Federated Search

3. Multilingual Dense Retrieval

4. Hybrid: BM25 + Multilingual Dense

Tokenization Matters

Mixed-Language Queries

Evaluation

Common Mistakes

Further Reading

Graph View

Table of Contents

Backlinks