Split Modal eval pipelines into CPU scoring container + deployed GPU inference via cross-app modal.Cls.from_name() to avoid paying GPU rates for CPU-bound scoring work.
When building eval/benchmark pipelines on Modal, the default approach is a single GPU container that loads the model, generates text, and scores it. But if you already have inference deployed as a persistent app, the eval job doesn't need its own GPU — it can call the deployed endpoint cross-app and do all scoring on CPU.
# modal_eval.py — CPU-only, no GPU, no torch
import modal
app = modal.App('my-eval')
data_vol = modal.Volume.from_name('my-data', create_if_missing=True)
# Minimal image — no torch/triton/xformers/unsloth
eval_image = (
modal.Image.debian_slim(python_version='3.11')
.pip_install('numpy', 'scipy', 'scikit-learn', 'anthropic')
.add_local_python_source('mypackage')
.add_local_dir('data/corpus', '/corpus')
)
@app.function(
image=eval_image,
volumes={'/data': data_vol},
timeout=2 * 60 * 60,
# No gpu= parameter — runs on CPU
)
def run_eval_job(count: int = 10) -> dict:
# Cross-app call to the ALREADY-DEPLOYED inference endpoint
Inference = modal.Cls.from_name('my-inference-app', 'Inference')
results = []
for prompt in prompts:
text = Inference().generate.remote(prompt, temperature=0.7)
score = score_locally(text) # CPU-bound: metrics, stylometry, etc.
results.append({'text': text, 'score': score})
save_results(results, '/data/evals/')
data_vol.commit() # critical: makes writes visible to `modal volume get`
return {'status': 'done'}modal.Cls.from_name('app-name', 'ClassName') returns a handle to a function in a different deployed Modal app. The call goes through Modal's RPC layer — the eval container sends the request, the inference container (with GPU) handles generation, returns the result. The eval container never loads a model.
For a 30-minute eval with 10 prompts (each taking ~30s to generate), you pay for ~5 min of GPU time instead of 30 min.
modal.Cls.from_name() fails immediately — the health check catches this within 60s..remote() call triggers a cold start (~30-60s for model loading). Subsequent calls hit warm containers.data_vol.commit() after writing results. Without it, modal volume get after job completion may miss the just-written files.