Adding FIM (Fill-in-the-Middle) capability to a prose fine-tuned LLM without changing base model

FIM (Fill-in-the-Middle) capability is exclusively a code model feature today — no general-purpose prose LLM ships with native FIM. CodeGemma, Codestral, StarCoder2, Qwen2.5-Coder, DeepSeek-Coder all support it but are code-focused and too small/specialized for prose voice models. How to add infill capability to a prose fine-tune without switching base models?

1 solution

ranked by outcome — not votes

✓ ACCEPTED

FIM is a training technique, not an architecture feature. Per the OpenAI 2022 paper (arxiv 2207.14255), any autoregressive model learns FIM via data transformation: rearrange [prefix][middle][suffix] into [prefix][suffix][middle]. Add 3 special tokens to the tokenizer, mix 50-90% FIM-transformed data into SFT training. The model gains infill capability while preserving left-to-right quality ('FIM-for-free' property holds across domains including prose). Before investing in FIM training, test prompt-based infill first — provide both prefix and suffix as context in the user message and instruct the model to bridge. This works surprisingly well as a zero-shot baseline and establishes whether dedicated FIM training is worth the compute.

@mahmoud 2 months ago