Modal: CPU-only eval/scoring container calling deployed GPU inference via cross-app modal.Cls.from_name()

Pattern

When building eval/benchmark pipelines on Modal, the default approach is a single GPU container that loads the model, generates text, and scores it. But if you already have inference deployed as a persistent app, the eval job doesn't need its own GPU — it can call the deployed endpoint cross-app and do all scoring on CPU.

Architecture

# modal_eval.py — CPU-only, no GPU, no torch
import modal

app = modal.App('my-eval')
data_vol = modal.Volume.from_name('my-data', create_if_missing=True)

# Minimal image — no torch/triton/xformers/unsloth
eval_image = (
    modal.Image.debian_slim(python_version='3.11')
    .pip_install('numpy', 'scipy', 'scikit-learn', 'anthropic')
    .add_local_python_source('mypackage')
    .add_local_dir('data/corpus', '/corpus')
)

@app.function(
    image=eval_image,
    volumes={'/data': data_vol},
    timeout=2 * 60 * 60,
    # No gpu= parameter — runs on CPU
)
def run_eval_job(count: int = 10) -> dict:
    # Cross-app call to the ALREADY-DEPLOYED inference endpoint
    Inference = modal.Cls.from_name('my-inference-app', 'Inference')
    
    results = []
    for prompt in prompts:
        text = Inference().generate.remote(prompt, temperature=0.7)
        score = score_locally(text)  # CPU-bound: metrics, stylometry, etc.
        results.append({'text': text, 'score': score})
    
    save_results(results, '/data/evals/')
    data_vol.commit()  # critical: makes writes visible to `modal volume get`
    return {'status': 'done'}

Why this works

modal.Cls.from_name('app-name', 'ClassName') returns a handle to a function in a different deployed Modal app. The call goes through Modal's RPC layer — the eval container sends the request, the inference container (with GPU) handles generation, returns the result. The eval container never loads a model.

Cost impact

GPU eval container (L40S): ~$2/hr for the entire eval, even though generation is <20% of wall-clock time
CPU eval container calling deployed inference: ~$0/hr for the eval container; inference containers are only charged for active generation time and may already be warm from production traffic

For a 30-minute eval with 10 prompts (each taking ~30s to generate), you pay for ~5 min of GPU time instead of 30 min.

Caveats

The inference app must be deployed first. If it's not deployed, modal.Cls.from_name() fails immediately — the health check catches this within 60s.
You're testing what's deployed, not a specific checkpoint. This is a feature for eval (test the production config) but a limitation for A/B testing different adapters. For adapter comparison, use a GPU container that loads each adapter.
Cold start: If no inference container is warm, the first .remote() call triggers a cold start (~30-60s for model loading). Subsequent calls hit warm containers.
Volume commit: Always call data_vol.commit() after writing results. Without it, modal volume get after job completion may miss the just-written files.