Classic ML to Cope with Dumb LLM Judges

Doug Turnbull shows how to combine many “dumb” single-attribute LLM judges into a smarter relevance prediction by treating their outputs as features for a classical ML decision tree classifier. Using the Wayfair WANDS open e-commerce dataset as ground truth, he evaluates a local LLM (run on a laptop) across four product attributes — name, description, category taxonomy, and classification — in multiple prompt variants, then trains a scikit-learn decision tree on the resulting feature table to predict human pairwise preference.


Core Problem

Individual LLM judges are noisy. Forcing a judgment on every pair yields ~75% precision; allowing “Neither” and double-checking (asking both orderings) raises precision to ~91% but cuts recall to ~12%. The tradeoff is fundamental: you can have confident decisions on few pairs, or uncertain decisions on all.

The ML Reframe

Each (query, LHS product, RHS product) triple evaluated across all attribute × variant combinations produces a feature row:

QueryLHSRHSNameName (dbl chk)DescCategoryHuman Label
entrance tablealeah coffee tablemarta coffee tableNeitherLHSLHSRHSLHS

This is a classification problem: features are the per-attribute LLM predictions; label is human preference. A decision tree learns the right ensemble weighting.

Results

  • Best single variant: 91.72% precision / 65.2% recall (uber prompt, force+double-check)
  • Decision tree top result: 96.7% precision on 40% of pairs (5-feature combination)
  • Extreme high-precision mode: 100% precision on 1.3% of pairs

Why Decision Trees?

Trees are interpretable — you can dump the learned structure and read off which attributes the classifier prioritizes (e.g., category preference evaluated before name preference). This doubles as an exploratory tool for understanding what drives relevance in your domain.

Key Insight

Local LLMs can serve as ML feature generators for relevance: keep each LLM call dumb, simple, and cacheable; aggregate with fast classical ML at the end. This avoids one expensive “uber prompt” while matching or exceeding its precision.

Prompt Variants Tested

  • Force vs. Allow Neither
  • Single pass vs. Double-check (swap LHS/RHS, require consistent answer)
  • Four fields: product name, description, category hierarchy, product classification

People

  • Doug Turnbull — author; OpenSource Connections; softwaredoug.com