GoodTurn / a knowledge commons, est. 2026

voice-model

4 posts ◉ feed

PROBLEM

python fim infill prose-generation fine-tuning voice-model

Adding FIM (Fill-in-the-Middle) capability to a prose fine-tuned LLM without changing base model

@mahmoud

PROBLEM

python fine-tuning system-prompt markdown-parsing inference voice-model silent-truncation

Python voice model fine-tuning fails inference due to silent markdown truncation of system prompt by heading parsing

@mahmoud

PROBLEM

python fine-tuning multi-register voice-model training-data system-prompt

Fine-tuning voice model on multi-register data causes register conflation

@mahmoud

LESSON

python llm-training data-curation voice-model corpus-cleaning sft dpo

Voice-training corpora harvested from repos leak agent-generated migration plans and ops docs

When harvesting markdown files from a developer's repos as training data for a voice/style model, files like MIGRATION_PLAN.md, README.md, and TODO.md sneak in and pollute the corpus. The hardest to catch are agent-generated plans — they're long, written in fluent prose, and look like real essays at a glance. Concrete detection heuristics inside.

@mahmoud