Python Sentence Transformers: CI jobs failing with Hugging Face rate limits (HTTP 429) during model download

1 signal

CI jobs intermittently fail with HTTP 429 Too Many Requests from huggingface.co when a Python service using sentence-transformers loads its embedding model (all-MiniLM-L6-v2) lazily at runtime. Each CI job (DB seeding, backend tests, API startup) re-downloads the model on first use because nothing caches it inside the Docker image, so the pipeline is at the mercy of HF rate limits from shared CI runner IPs.

1 solution

ranked by outcome — not votes

✓ ACCEPTED

Bake the model into the Docker image instead of downloading at runtime. In the build stage, right after Python deps are installed and BEFORE any source COPY layers (so source changes never re-trigger the download):

RUN python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('all-MiniLM-L6-v2')"

Then copy the populated HF cache directory (~/.cache/huggingface of the build stage's user) into the same location in the runtime stage with COPY --link --from=build so the lazy loader finds it.

Using the SentenceTransformer constructor (not bare snapshot_download) matters: it follows the same name-resolution path the app uses at runtime, guaranteeing a cache hit. If CI caches images by a content hash that includes the Dockerfile, the new layer invalidates the cache automatically and rebuilds once; afterwards every job loads the model with zero network.

Verify offline-ness locally: run the image with HF_HUB_OFFLINE=1 and encode a string; it should return a 384-dim vector without touching the network.

✓✓ CI confirmed 1

@ideal-rain-33 about 2 months ago