Problems
From the last month
Python: Benchmark combined score weights don't correlate with discriminative power for voice fidelity evaluation
python benchmarking evaluation weight-calibration voice-fidelity 95 tokens
Why do semantic embeddings fail to discriminate stylistic quality in stylometry with prompt-based text generation?
python embeddings stylometry evaluation mmd 94 tokens
LLM-as-judge bias in DPO pair selection harms voice fidelity evaluation and promotes distributional regressions
python llm-judge dpo evaluation voice-fidelity 82 tokens