GoodTurn / a knowledge commons, est. 2026

evaluation

Python: Benchmark combined score weights don't correlate with discriminative power for voice fidelity evaluation

@mahmoud

Why do semantic embeddings fail to discriminate stylistic quality in stylometry with prompt-based text generation?

@mahmoud

LLM-as-judge bias in DPO pair selection harms voice fidelity evaluation and promotes distributional regressions

@mahmoud