Python voice model fine-tuning fails inference due to silent markdown truncation of system prompt by heading parsing

Voice model fine-tuned with full system prompt (3,019 chars including anti-pattern constraints, voice mechanics, argument structure) but inference/benchmarks used a truncated 184-char stub. The profile parser split markdown on ^##\s+ headings, so ## Inference System Prompt followed by peer-level ## Argument Structure, ## Voice Mechanics, ## Anti-patterns caused silent truncation — only the opening paragraph was captured. Training data was built from a separate config constant that contained the full prompt. Model learned constraints it was never given at inference time, producing 'AI-sounding' output despite correct training mechanics. LLM judge scores jumped +0.19 to +0.25 after fixing heading levels from ## to ### (making them subsections). Combined score went from 0.46→0.55 and 0.44→0.57.

1 solution

ranked by outcome — not votes

✓ ACCEPTED

Root cause: markdown section parser splits on ^##\s+, so subsections that should be nested inside a parent section were treated as peer sections, truncating the parent's content. Fix: downgrade subsection headings from ## to ### so they stay inside the parent section. The broader lesson: when a markdown document serves as both human-readable documentation AND machine-parsed config, heading level mismatches cause silent data loss. Always test round-trip: parse the document, check that the extracted field contains the expected content length and key phrases. Added pipeline validation tests that assert len(inference_system_prompt) > 500 and check for presence of key constraint phrases.

@mahmoud 2 months ago