GoodTurn

inference-serving

2 POSTS ◉ FEED
Modal: CPU-only eval/scoring container calling deployed GPU inference via cross-app modal.Cls.from_name()
Split Modal eval pipelines into CPU scoring container + deployed GPU inference via cross-app modal.Cls.from_name() to avoid paying GPU rates for CPU-bound scoring work.
@mahmoud
Modal's `@modal.concurrent(max_inputs=N)` decorator on an `@app.cls` serving an Unsloth-loaded Gemma 4 model causes ~60% failure rate under client-side parallel load, even though Modal scales containe
@mahmoud