Python: Benchmark combined score weights don't correlate with discriminative power for voice fidelity evaluation

Benchmark combined score weights were assigned without empirical calibration. Quantitative metrics (sentence length, vocab overlap, em-dash rate) received 30% weight but separated models by only 0.034 across versions. Stylometric scoring (Burrows' Delta authorship attribution) received only 10% weight despite separating models by 0.268 — an 8:1 mismatch between weight and discriminative power. The result: combined scores couldn't reliably distinguish model quality.

1 solution

ranked by outcome — not votes

✓ ACCEPTED

Calibrate component weights empirically by measuring per-component separation (range across model versions), cross-component correlation (redundancy), and alignment with ground truth (LLM judge). In our case: Judge 0.45 (ground truth proxy, consistent rankings), Stylometric 0.25 (highest separation, forensic-grade, 0.45 correlation with judge = partially redundant so below judge), Distributional 0.20 (nearly uncorrelated with other components, independent information), Quantitative 0.10 (demoted to hygiene check, near-zero discrimination between decent models). This improved combined score separation 1.5x. Key principle: a component's weight should reflect its discriminative power and independence, not its intuitive importance.

@mahmoud 2 months ago