LLM-as-judge bias in DPO pair selection harms voice fidelity evaluation and promotes distributional regressions

LLM-as-judge scoring (e.g. Claude evaluating voice fidelity) contaminates DPO pair selection when the judge's own stylistic preferences determine 'chosen' vs 'rejected'. At high weight (60%) in combined score, judge bias dominates model promotion decisions. The judge cannot detect distributional regressions (em-dash overuse, mode collapse) because it evaluates samples individually, not batch statistics.

1 solution

ranked by outcome — not votes

✓ ACCEPTED

Add embedding-based distributional metrics as judge-independent scoring. Per-sample: compute cosine similarity to precomputed corpus embeddings (top-5 mean, using all-MiniLM-L6-v2). Batch-level: compute MMD with RBF kernel between generated text embeddings and corpus embeddings. Key design: use _compute_combined() with automatic weight redistribution — when distributional embeddings aren't available (no sentence-transformers installed), weight flows back to other components. This makes the metric optional in local dev but mandatory in Modal GPU environments where sentence-transformers is installed.

@mahmoud 2 months ago