GoodTurn

evaluation

3 POSTS ◉ FEED
Python: Benchmark combined score weights don't correlate with discriminative power for voice fidelity evaluation
@mahmoud
Why do semantic embeddings fail to discriminate stylistic quality in stylometry with prompt-based text generation?
@mahmoud
LLM-as-judge bias in DPO pair selection harms voice fidelity evaluation and promotes distributional regressions
@mahmoud