LLMs in Production: 5 Architecture Patterns
Five patterns observed in production-grade LLM systems for managing cost, latency, and reliability.
LLMs in Production: 5 Architecture Patterns
Putting an LLM in production is not simply a matter of calling an OpenAI API from your backend. Once load increases, costs escalate, latency becomes critical, or reliability is contractually required, naive architectures collapse. Here are five patterns observed in production-grade LLM systems.
Pattern 1 — Simple Proxy
Context: a team getting started, low volume, need to centralize calls.
The simple proxy sits between your applications and the LLM APIs. It centralizes key management, rate limiting, logging, and observability. This is the entry-level pattern — almost always the right starting point.
Application → LLM Proxy → OpenAI / Anthropic / Mistral
↓
Logs, metrics, budget alerts
Tools: LiteLLM, Portkey, or a custom proxy in around a hundred lines. The proxy exposes an OpenAI-compatible interface, making migration to other providers transparent.
Tradeoffs: adds a network hop (10–30ms). Must be highly available — it is a single point of failure if poorly designed.
Pattern 2 — Semantic Cache
Context: workloads with redundant queries (FAQ, customer support, document search).
The semantic cache stores responses to queries and compares them by vector similarity rather than exact equality. A query close to a previously processed query returns the cached response without an LLM call.
async def query_with_cache(prompt: str, threshold: float = 0.92):
embedding = await embed(prompt)
cached = await vector_store.similarity_search(embedding, top_k=1)
if cached and cached[0].score >= threshold:
return cached[0].response # ~5ms instead of ~800ms
response = await llm.complete(prompt)
await vector_store.upsert(embedding, prompt, response)
return responseObserved gains: 30 to 60% cost reduction on FAQ workloads in production. Perceived latency drops dramatically.
Tradeoffs: the similarity threshold is critical. Too low: incorrect responses returned. Too high: cache rarely effective. Requires an invalidation strategy for domains with changing data.
Pattern 3 — Multi-Model Router with Fallback
Context: cost/performance optimization, resilience against provider outages.
Not all queries require GPT-4. A router classifies request complexity and dispatches to the appropriate model: local or small models for simple tasks, large models for complex reasoning.
Request → Complexity Classifier
├── Simple (~80%) → small local model
│ (no marginal API cost, but non-zero infra cost)
├── Medium (~15%) → intermediate general-purpose model
└── Complex (~5%) → large frontier model
↓ (fallback if timeout)
alternative frontier model
Example stack (January 2025 snapshot): Mistral 7B / Llama 3.1 8B locally, GPT-4o-mini or Haiku for intermediate, GPT-4o or Claude Opus for frontier. Prices and models evolve quickly — the tiers remain stable, the names do not.
Tradeoffs: the classifier itself consumes resources. The definition of "complexity" is often empirical and requires adjustment. Fallback must be tested regularly.
Pattern 4 — RAG Augmentation Layer
Context: domains where the LLM lacks necessary knowledge (internal documentation, recent data, proprietary corpora).
This pattern isolates the retrieval layer as an independent service. The LLM becomes a reasoning engine over dynamically provided facts, not an autonomous source of truth.
The layered architecture allows updating the knowledge corpus without redeploying the LLM, precisely tracing the sources of each response, and auditing hallucinations.
Tradeoffs: adds 200–400ms of latency for retrieval. Response quality is strongly correlated with index quality.
Pattern 5 — Agent with Tools
Context: tasks requiring real-world actions (web search, code execution, API calls, file manipulation).
The LLM agent decides which tools to call, in what order, iterating until the objective is satisfied. This is the most powerful and most risky pattern.
Non-negotiable guardrails:
- Human confirmation for any irreversible action
- Execution sandbox for generated code
- Timeout and token/call budget per session
- Audit log of each tool called with its parameters
Tradeoffs: unpredictable latency (1 to N LLM calls), costs hard to bound, emergent behaviors difficult to test exhaustively.
These five patterns are not mutually exclusive. A mature system often combines the proxy (observability), semantic cache (cost), router (optimization), and RAG (knowledge). Agents are reserved for flows where autonomy delivers irreplaceable business value.