Voice-training corpora harvested from repos leak agent-generated migration plans and ops docs

Context

Training an author-voice LLM (Gemma 4 31B + LoRA, SFT → DPO → SDPO pipeline) for an open-source maintainer. The corpus is harvested from his blog plus a hand-enumerated list of markdown files from his GitHub repos (MIGRATION_PLAN.md, README.md, TODO.md, design.md, etc).

Symptom

While spot-checking the SDPO training data, one essay (39kch) titled Wikimon: Single-Process Refactor and Parallel Deployment Plan looked off. Author confirmed: "I don't know how this ended up in the corpus, i think it's a plan generated by an agent, not by me."

Auditing the rest of the corpus turned up:

2 of 19 hatnote markdown files were full agent-generated migration plans (39kch + 16kch)
3 of 19 were TODO.md checklists
1 was a machine-aggregated PR report (ACTIVE_PRS.md)
An entire source (sedimental/pages, 28 files) was landing-page link descriptions, not essays — every entry was a 200-500ch meta-paragraph about an external talk or post
2 entries were tweets

Roughly 20% of the 471-triple SDPO corpus was non-essay content the voice model would learn the wrong things from.

Why it's non-obvious

Naive markdown harvesters apply length + extension filters. That catches empty files and .pngs. It does not catch:

Agent-generated docs in human-developer repos. Increasingly, devs let coding agents write migration plans, design docs, and post-mortems and commit them. Those files have the dev's name on them (via git blame) but not the dev's voice.
Documentation prose vs essay prose. READMEs are real human writing, but they don't carry essay-voice patterns. Training on them teaches "explain a project" rather than "make an argument about a topic".
Landing pages / link-out stubs. Personal sites often have /pages/ or /projects/ directories where each "post" is really a redirect description. They look like blog posts to a harvester.

Detection heuristics that worked

import re

_BULLET_RX = re.compile(r'^\s*([-*+]|\d+\.|\[[ x]\])\s')

def drop_reason(record: dict) -> str | None:
    src = record.get('source', '')
    fb = record.get('feedback', '') or ''       # the document text
    title = record.get('essay_title', '') or ''

    # Source-level: known stub directories or genres
    if src in ('sedimental/pages', 'twitter'):
        return 'source_stub'

    # Length floor: anything under 500ch is too short to teach voice
    if len(fb) < 500:
        return 'too_short'

    # Bullet ratio: lists are not prose
    lines = [l for l in fb.splitlines() if l.strip()]
    if lines:
        bullets = sum(1 for l in lines if _BULLET_RX.match(l))
        if bullets / len(lines) > 0.5:
            return 'list_content'

    # Title keywords
    if re.search(r'\b(TODO|Checklist)\b', title, re.I):
        return 'todo_title'

    # Agent/ops detector — multiple signals required (single-signal trips
    # false-positive on real essays that happen to mention "Phase 1")
    signals = 0
    if 'TO BE DONE MANUALLY' in fb:
        signals += 2
    if re.search(r'\bRollback\s+Plan\b', fb):
        signals += 1
    if re.search(r'^\s*Phase \d+:', fb, re.MULTILINE):
        signals += 1
    if fb.count('supervisorctl') >= 3 or fb.count('systemctl') >= 3:
        signals += 1
    if fb.count('|') > 200:    # heavy markdown tables
        signals += 1
    if fb.count('```') > 20:   # heavy code-fence density
        signals += 1
    if signals >= 2:
        return 'agent_genned'

    return None

Calibration on the real corpus:

bullet_ratio > 0.5 cleanly separated three known TODO files (ratios 0.52 / 0.73 / 0.80) from the closest real essay Architecture (0.44).
The agent_genned ≥2-signal threshold caught both migration plans without false-positiving on long real essays (the longest real essay Design at 24kch and Wikimon Plan at 39kch were the differential test).
len < 500 floor caught page stubs without dropping legitimate micro-posts (real short blog posts came in at 511-924ch).

Lessons

Audit the input list of a harvest pipeline by hand once. The harvest filenames (MIGRATION_PLAN.md, ACTIVE_PRS.md) often telegraph the contamination. We were blind to two clearly bad entries in a 19-file enumerated list because nobody re-read the list after adding entries over months.
A single heuristic signal is too noisy; require ≥2 for "agent-generated" classification. Real essays often have one of these patterns (a Phase header, a couple supervisorctl mentions). Compounded signals are what distinguishes a plan-document from an essay that mentions infrastructure.
Filter at multiple layers (defense in depth):
- Generation-time: skip Modal inference calls on records that would be filtered anyway (saved ~15% of compute on the run that surfaced this).
- Training-time: filter in the dataset loader so even unfiltered files can't poison training.
- Source-time (upstream): remove the bad entries from the harvest enumeration list for future regenerations.
When auditing, check the corpus file the trainer sees, not the upstream pipeline. The bad entries were obvious in the harvested JSONL but invisible at the source-config level until I traced one bad essay back through its source field to the enumeration list.
For voice/style models specifically, "is this prose the author wrote" is a stricter filter than "did this come from the author's repo". Coding-agent-generated docs in a human's repo are a growing contamination vector for any LLM training pipeline that uses GitHub as a source.