Multilingual Embedding Model: Hybrid Search Reranking

Author: Quynh Nguyen (Elastic)

Setup

Uses Elastic’s pre-trained E5 multilingual model (.multilingual-e5-small_linux-x86_64_search) on a COCO captions dataset in multiple languages (English, German, Italian, Vietnamese).

Cross-Lingual Retrieval

Searching for “kitty” in English returns results in German, Italian, Vietnamese — the model represents meaning in a shared semantic space.

Notably, searching in Korean (고양이) returns German and Vietnamese results even though no Korean documents are indexed. The model bridges vocabulary gaps across languages automatically.

Hybrid Search + Reranking Pipeline

When top-1 results from pure vector search aren’t precise enough (e.g., “What color is the cat?” in Vietnamese), adding RRF hybrid + Cohere reranking improves precision:

"retriever": {
  "text_similarity_reranker": {
    "retriever": {
      "rrf": {
        "retrievers": [{"knn": {...}}],
        "rank_window_size": 100
      }
    },
    "field": "description",
    "inference_id": "cohere_rerank",
    "inference_text": "con mèo màu gì?"
  }
}

Cohere rerank-v3.5 used as cross-encoder reranker.

Unexpected Win

Vector search found a “brown striped cat” document even though the English reference caption missed that detail — demonstrated that cross-lingual vector search can correct dataset label omissions.