Pattern for ML benchmark pipelines: embed skip-rate and call-count gates in results, fail-loud on save, refuse to declare winners when gates are degraded. Prevents acting on silently broken scores.
ML benchmark pipelines commonly have stages that can silently degrade: a scoring model returns a default value when its dependency is missing, an LLM judge silently returns None when the API key is revoked, a stylometric analysis falls back to 0.5 when a library isn't installed. The benchmark still produces a JSON with numbers — they just don't mean anything.
The failure mode: you compare two model checkpoints, pick a winner based on these degraded scores, deploy it to production, and only discover the scoring was broken when users complain about quality regression.
Embed gate metrics directly into the benchmark result dataclass:
@dataclass
class BenchmarkResult:
# ... normal score fields ...
quality_gates: dict = field(default_factory=dict)Populate them in the summary computation from per-sample stage details:
quality_gates = {
'styl_skip_rate': skipped_count / sample_count,
'judge_call_count': len(judge_scores),
'judge_skip_rate': 1.0 - len(judge_scores) / sample_count,
'format_pass_rate': stage_pass_rates['format_gate'],
}Then enforce at two points:
save_benchmark(strict=True) — raises RuntimeError if any gate is violated. Prevents silently saving a degraded result to disk.
compare_benchmarks() — checks all candidates' gates. If any is degraded, returns winner=None with a warning. Prevents promotion decisions based on bad data.
The gates are cheap to compute (they aggregate data already collected during scoring) but prevent the most expensive failure mode: acting on misleading benchmark numbers. A --no-strict escape hatch exists for research-only sweeps where you know a stage is intentionally disabled.
A separate preflight function (~30s, no GPU, ~$0) exercises the full scoring pipeline on a canned input before any expensive sweep begins. Catches missing deps, expired API keys, profile/corpus misalignment, and base-model config drift — all without burning GPU time.