The Inference Gateway: the security boundary

The component that holds the keys and guards the egress: its pipeline, the threat model around it, what anonymization does to outbound prompts, how model routing stays inspectable, and where the inference-tier boundary sits.

Inference Gateway pipeline

The gateway is shown as a single service in the main diagram; its internal request flow matters enough to call out separately. Every inbound request follows this pipeline:

            ┌──────────────┐
   Inbound  │ Auth         │  (API key resolution; rejects unauthenticated)
   request──▶│              │
            └──────┬───────┘
                   ▼
            ┌──────────────┐
            │ Router       │  (provider/model selection; fallback chains)
            └──────┬───────┘
                   ▼
            ┌──────────────┐
            │ Rate Limit   │  (Redis token bucket; per-key, per-model)
            └──────┬───────┘
                   ▼
            ┌──────────────┐
            │ Tier         │  (annotates request with routed_inference_tier 1–5;
            │ Derivation   │   refuses if below skill or Project minimum)
            └──────┬───────┘
                   ▼
            ┌──────────────┐
            │ Anonymization│  (M2; pseudonymizes sensitive entities;
            │ — pre        │   stable mapping for the request lifetime)
            └──────┬───────┘
                   ▼
            ┌──────────────┐
            │ Provider     │  (HTTP/gRPC to Anthropic / OpenAI / Azure /
            │ Adapter      │   Ollama / vLLM; Vertex + Bedrock deferred — DE-034/035)
            └──────┬───────┘
                   │
                 (response)
                   │
                   ▼
            ┌──────────────┐
            │ Anonymization│  (M2; rehydrates pseudonyms in response and
            │ — post       │   inside cited chunks; mapping discarded)
            └──────┬───────┘
                   ▼
            ┌──────────────┐
            │ Cost Tracker │  (tokens × per-model rates; tagged for analytics)
            └──────┬───────┘
                   ▼
            ┌──────────────┐
   Outbound │ Telemetry    │  (OTel traces; Langfuse if configured)
   response │              │
   ◀────────│              │
            └──────────────┘

The pipeline is what makes the Inference Tier model operationally real. The Tier Derivation stage is the choke point: every request gets classified, every classification is logged in the audit trail, and every UI surface reflects the actual routed tier in real time. The user does not have to take the application's word for it — they can verify the tier badge against the operator's gateway configuration and against the audit log entries.

The Anonymization stages bracket the provider adapter; pseudonyms exist in the mapping table for the duration of the request (in process memory only — never persisted) and are rehydrated on the way back so the Citation Engine sees the original text for verification. Privilege-flagged Projects disable anonymization by default — for privileged content, the operator is better served by Tier 1 (local inference, no third-party touch) than by an anonymization layer that adds processing steps complicating a privilege analysis.

The provider adapters resolve their keys from the gateway's secret store, and that store can be administered at runtime: an admin-only /admin/v1/provider-keys surface (proxied by the backend at /api/v1/admin/provider-keys) sets a key Fernet-encrypted into gateway.yaml and hot-applies it to the live adapter with no restart — keys can be added, rotated, and revoked without a redeploy (post-v0.4.0, #128; requires LQ_AI_GATEWAY_MASTER_KEY). Environment-supplied keys continue to work; the listing masks every key and reports its source (env vs runtime).

For the full gateway specification including the configuration YAML and the OpenAPI surface, see PRD §4.