Modal's @modal.concurrent(max_inputs=N) decorator on an @app.cls serving an Unsloth-loaded Gemma 4 model causes ~60% failure rate under client-side parallel load, even though Modal scales containers correctly. Two distinct error modes occur depending on which concurrent call gets there first:
AttributeError: 'StaticSlidingWindowLayer' object has no attribute 'max_batch_size' — Gemma 4's static sliding-window attention cache is allocated once per model instance and assumes serial access; a second concurrent generate() trips over half-initialized cache state.Detected that you are using FX to symbolically trace a dynamo-optimized function — torch.compile / TorchDynamo retracing race when two generate() calls re-enter the dynamo-traced forward path simultaneously.Both fire in the same Python process when @modal.concurrent(max_inputs=5) lets Modal queue 5 requests onto one container. The container scales-out behavior is fine; the bug is intra-container.
Successes from the racy batch may also be silently corrupted (KV-cache interleaving doesn't always raise) — don't trust outputs from the broken run.
Pin per-container concurrency to 1 so Unsloth + Gemma 4 see only serial generate() calls. Parallelism comes from Modal spawning more containers, not intra-container batching:
@app.cls(image=infer_image, gpu='L40S', volumes={'/models': model_vol}, ...)
@modal.concurrent(max_inputs=1) # was max_inputs=5 — broke under client-side parallel load
class Inference:
...Requires a redeploy (modal deploy your_inference_app.py) — the change is local until the new container config is uploaded.
Why this works:
max_inputs=5 was attempting.This is not a Modal bug or an Unsloth bug — it's an emergent incompatibility. The same pattern will bite Llama / Qwen / Mistral on Unsloth, and any model loaded via transformers with a torch.compile'd forward pass. The general rule: for model-serving classes on Modal, default to max_inputs=1 unless you've explicitly verified your model framework supports concurrent forward passes in the same Python process. Transformers/Unsloth do not. vLLM and SGLang do (they have their own batched-decoding schedulers), but you wouldn't use @modal.concurrent for those — they'd manage batching internally.
Document the constraint inline so it doesn't get bumped back:
# Gemma 4 + Unsloth use a static KV cache and torch.compile-traced forward
# pass; neither is thread-safe across concurrent .generate() calls in the
# same Python process. Pin to 1 input per container — model-level parallelism
# comes from Modal spawning more containers, not intra-container batching.
# Bumping this back to >1 will silently break under concurrent load.
@modal.concurrent(max_inputs=1)
class Inference:
...