When harvesting markdown files from a developer's repos as training data for a voice/style model, files like MIGRATION_PLAN.md, README.md, and TODO.md sneak in and pollute the corpus. The hardest to catch are agent-generated plans — they're long, written in fluent prose, and look like real essays at a glance. Concrete detection heuristics inside.
Training an author-voice LLM (Gemma 4 31B + LoRA, SFT → DPO → SDPO pipeline) for an open-source maintainer. The corpus is harvested from his blog plus a hand-enumerated list of markdown files from his GitHub repos (MIGRATION_PLAN.md, README.md, TODO.md, design.md, etc).
While spot-checking the SDPO training data, one essay (39kch) titled Wikimon: Single-Process Refactor and Parallel Deployment Plan looked off. Author confirmed: "I don't know how this ended up in the corpus, i think it's a plan generated by an agent, not by me."
Auditing the rest of the corpus turned up:
TODO.md checklistsACTIVE_PRS.md)sedimental/pages, 28 files) was landing-page link descriptions, not essays — every entry was a 200-500ch meta-paragraph about an external talk or postRoughly 20% of the 471-triple SDPO corpus was non-essay content the voice model would learn the wrong things from.
Naive markdown harvesters apply length + extension filters. That catches empty files and .pngs. It does not catch:
git blame) but not the dev's voice./pages/ or /projects/ directories where each "post" is really a redirect description. They look like blog posts to a harvester.import re
_BULLET_RX = re.compile(r'^\s*([-*+]|\d+\.|\[[ x]\])\s')
def drop_reason(record: dict) -> str | None:
src = record.get('source', '')
fb = record.get('feedback', '') or '' # the document text
title = record.get('essay_title', '') or ''
# Source-level: known stub directories or genres
if src in ('sedimental/pages', 'twitter'):
return 'source_stub'
# Length floor: anything under 500ch is too short to teach voice
if len(fb) < 500:
return 'too_short'
# Bullet ratio: lists are not prose
lines = [l for l in fb.splitlines() if l.strip()]
if lines:
bullets = sum(1 for l in lines if _BULLET_RX.match(l))
if bullets / len(lines) > 0.5:
return 'list_content'
# Title keywords
if re.search(r'\b(TODO|Checklist)\b', title, re.I):
return 'todo_title'
# Agent/ops detector — multiple signals required (single-signal trips
# false-positive on real essays that happen to mention "Phase 1")
signals = 0
if 'TO BE DONE MANUALLY' in fb:
signals += 2
if re.search(r'\bRollback\s+Plan\b', fb):
signals += 1
if re.search(r'^\s*Phase \d+:', fb, re.MULTILINE):
signals += 1
if fb.count('supervisorctl') >= 3 or fb.count('systemctl') >= 3:
signals += 1
if fb.count('|') > 200: # heavy markdown tables
signals += 1
if fb.count('```') > 20: # heavy code-fence density
signals += 1
if signals >= 2:
return 'agent_genned'
return NoneCalibration on the real corpus:
bullet_ratio > 0.5 cleanly separated three known TODO files (ratios 0.52 / 0.73 / 0.80) from the closest real essay Architecture (0.44).agent_genned ≥2-signal threshold caught both migration plans without false-positiving on long real essays (the longest real essay Design at 24kch and Wikimon Plan at 39kch were the differential test).len < 500 floor caught page stubs without dropping legitimate micro-posts (real short blog posts came in at 511-924ch).Audit the input list of a harvest pipeline by hand once. The harvest filenames (MIGRATION_PLAN.md, ACTIVE_PRS.md) often telegraph the contamination. We were blind to two clearly bad entries in a 19-file enumerated list because nobody re-read the list after adding entries over months.
A single heuristic signal is too noisy; require ≥2 for "agent-generated" classification. Real essays often have one of these patterns (a Phase header, a couple supervisorctl mentions). Compounded signals are what distinguishes a plan-document from an essay that mentions infrastructure.
Filter at multiple layers (defense in depth):
When auditing, check the corpus file the trainer sees, not the upstream pipeline. The bad entries were obvious in the harvested JSONL but invisible at the source-config level until I traced one bad essay back through its source field to the enumeration list.
For voice/style models specifically, "is this prose the author wrote" is a stricter filter than "did this come from the author's repo". Coding-agent-generated docs in a human's repo are a growing contamination vector for any LLM training pipeline that uses GitHub as a source.