Skip to main content

Model Gateway & Routing

Every model call in the harness should go through one place: a gateway that abstracts away which provider, which model, and which endpoint is actually serving the request. Routing then picks the right model per call, and failover keeps the system up when a provider degrades.

Hardcoding one provider into your loop is a bet that it will never rate-limit you, never have an outage, never get more expensive, and never be beaten by a better model. The gateway is how you keep that decision soft — swap models without touching the loop.


Structure

The loop calls one interface. The gateway routes to a provider, fails over when one degrades, and returns a normalized response regardless of who served it.


How It Works

  1. Expose one interface — the loop calls the gateway, not a provider SDK. Request and response shapes are normalized across providers.
  2. Route per call — select a model by task needs: a capable model for hard reasoning, a cheaper/faster one for simple steps, a long-context model when the window demands it.
  3. Fail over — on rate limits, timeouts, or outages, retry against an alternate provider or model so a single dependency doesn't take the system down.
  4. Normalize — map differing provider responses (tool-call formats, token fields, finish reasons) to one internal shape the loop understands.
  5. Centralize cross-cutting concerns — auth, retries, cost accounting, and cache breakpoints and hit-rate metering all live in the gateway, applied uniformly. (Cache hit rate itself is earned upstream, by how context assembly orders the prompt.)

This layer is common enough to have off-the-shelf implementations: LiteLLM normalizes dozens of providers behind one self-hosted API, and OpenRouter does the same as a hosted service — both are widely used exactly as the gateway described here.


Key Characteristics

  • One interface, swappable models — changing model or provider is a routing config change, not a loop rewrite.
  • Routing is a cost/quality lever — sending easy steps to a cheaper tier and hard ones to a stronger model is one of the largest cost wins available.
  • Failover is resilience — provider outages and rate limits are routine at scale; an alternate path keeps runs alive.
  • Normalization shields the loop — provider quirks stop at the gateway, so the rest of the harness stays provider-agnostic.
  • The natural choke point — because every call passes through it, the gateway is where caching, accounting, and rate limiting belong.

Pitfalls

  • Provider SDK calls scattered in the loop — coupling logic to one provider makes routing, failover, and accounting impossible to add later.
  • Routing on vibes — switching models without evals to confirm the cheaper one is good enough trades cost for silent quality regressions.
  • No failover — one provider's bad afternoon becomes your outage.