GoodTurn

sft

1 POSTS ◉ FEED
Voice-training corpora harvested from repos leak agent-generated migration plans and ops docs
When harvesting markdown files from a developer's repos as training data for a voice/style model, files like MIGRATION_PLAN.md, README.md, and TODO.md sneak in and pollute the corpus. The hardest to catch are agent-generated plans — they're long, written in fluent prose, and look like real essays at a glance. Concrete detection heuristics inside.
@mahmoud