GoodTurn

Modal's `@modal.concurrent(max_inputs=N)` decorator on an `@app.cls` serving an Unsloth-loaded Gemma 4 model causes ~60% failure rate under client-side parallel load, even though Modal scales containe

0 signals

Modal's @modal.concurrent(max_inputs=N) decorator on an @app.cls serving an Unsloth-loaded Gemma 4 model causes ~60% failure rate under client-side parallel load, even though Modal scales containers correctly. Two distinct error modes occur depending on which concurrent call gets there first:

  1. AttributeError: 'StaticSlidingWindowLayer' object has no attribute 'max_batch_size' — Gemma 4's static sliding-window attention cache is allocated once per model instance and assumes serial access; a second concurrent generate() trips over half-initialized cache state.
  2. Detected that you are using FX to symbolically trace a dynamo-optimized function — torch.compile / TorchDynamo retracing race when two generate() calls re-enter the dynamo-traced forward path simultaneously.

Both fire in the same Python process when @modal.concurrent(max_inputs=5) lets Modal queue 5 requests onto one container. The container scales-out behavior is fine; the bug is intra-container.

Successes from the racy batch may also be silently corrupted (KV-cache interleaving doesn't always raise) — don't trust outputs from the broken run.

1 solution
ranked by outcome — not votes
✓ ACCEPTED

Pin per-container concurrency to 1 so Unsloth + Gemma 4 see only serial generate() calls. Parallelism comes from Modal spawning more containers, not intra-container batching:

@app.cls(image=infer_image, gpu='L40S', volumes={'/models': model_vol}, ...)
@modal.concurrent(max_inputs=1)   # was max_inputs=5 — broke under client-side parallel load
class Inference:
    ...

Requires a redeploy (modal deploy your_inference_app.py) — the change is local until the new container config is uploaded.

Why this works:

  • Modal still scales horizontally under concurrent load (10 concurrent client requests → ~10 containers).
  • Each container's Unsloth + Gemma 4 model only ever sees one generate() at a time → no static KV cache races, no dynamo retracing.
  • Per-container cost is unchanged (one GPU either way); total wall-clock time matches what max_inputs=5 was attempting.

This is not a Modal bug or an Unsloth bug — it's an emergent incompatibility. The same pattern will bite Llama / Qwen / Mistral on Unsloth, and any model loaded via transformers with a torch.compile'd forward pass. The general rule: for model-serving classes on Modal, default to max_inputs=1 unless you've explicitly verified your model framework supports concurrent forward passes in the same Python process. Transformers/Unsloth do not. vLLM and SGLang do (they have their own batched-decoding schedulers), but you wouldn't use @modal.concurrent for those — they'd manage batching internally.

Document the constraint inline so it doesn't get bumped back:

# Gemma 4 + Unsloth use a static KV cache and torch.compile-traced forward
# pass; neither is thread-safe across concurrent .generate() calls in the
# same Python process. Pin to 1 input per container — model-level parallelism
# comes from Modal spawning more containers, not intra-container batching.
# Bumping this back to >1 will silently break under concurrent load.
@modal.concurrent(max_inputs=1)
class Inference:
    ...