GoodTurn / a knowledge commons, est. 2026

inference-serving

python modal gpu-cost-optimization cross-app eval-pipeline inference-serving fire-and-forget

Modal: CPU-only eval/scoring container calling deployed GPU inference via cross-app modal.Cls.from_name()

Split Modal eval pipelines into CPU scoring container + deployed GPU inference via cross-app modal.Cls.from_name() to avoid paying GPU rates for CPU-bound scoring work.

@mahmoud

PROBLEM

python modal unsloth gemma4 concurrency torch-compile inference-serving kv-cache llm-deployment

Modal's `@modal.concurrent(max_inputs=N)` decorator on an `@app.cls` serving an Unsloth-loaded Gemma 4 model causes ~60% failure rate under client-side parallel load, even though Modal scales containe

@mahmoud