23
LLM 工程化与网关
LLM platformization builds a gateway so every model call goes through one place with consistent auth, routing, quotas, observability, and safety.
1) Goals
- One entrypoint for all LLM calls (OpenAI / Claude / Gemini / self-hosted).
- Uniform auth, rate limits, logging, schema checks, and safe rollout levers.
- Swappable providers with canary, A/B, and kill switch baked in.
AI Engineer · System Design
立即查看 →AI 系统设计路线:从架构到工程落地
掌握高可用与可扩展设计,构建可靠 AI 系统。
2) Core capabilities
- Auth & secrets: per-tenant/service keys; never expose provider keys to clients.
- Rate limiting & quotas: global + per-tenant; burst control; model allowlists.
- Request shaping: default headers/timeouts; caps on max tokens/temperature/input size.
- Validation: required fields, JSON schema for tool/function calls.
- Safety: content filters, PII scrubbing, prompt-injection filters.
- Routing: provider selection by policy (cost/region/latency); model aliases.
- Retries & backoff: idempotent POST with request IDs; 429/5xx backoff.
- Observability: structured logs (traceId, model, tokens, latency); metrics to dashboards.
3) Config shape (example)
models:
gpt-5:
provider: openai
model: gpt-5
timeout_ms: 60000
limits:
max_tokens: 2000
rpm: 120
safety:
pii_scrub: true
blocklist_categories: [self_harm, violence]
rollout:
traffic_percent: 5 # canary
allow_tenants: [beta-team]
claude-3-5-sonnet:
provider: anthropic
model: claude-3-5-sonnet-latest
fallback: gpt-4o
routes:
chat:
model_alias: gpt-5
primary: gpt-5
fallback: claude-3-5-sonnet
limits:
max_prompt_chars: 32000
tools_schema: schemas/tool-calls.json
4) Minimal gateway handler(Node/TypeScript 示例)
import express from 'express';
import axios from 'axios';
import { validate } from './schema';
import { pickModelByPolicy } from './routing';
import { rateLimit } from './ratelimit';
import { scrubPII } from './safety';
import { config } from './config';
const app = express();
app.use(express.json({ limit: '2mb' }));
app.post('/v1/chat', rateLimit(), async (req, res) => {
const traceId = req.header('x-trace-id') || crypto.randomUUID();
const { messages, tools, model_alias } = req.body;
try {
validate(messages, 'schemas/messages.json');
if (tools) validate(tools, 'schemas/tools.json');
const route = config.routes.chat;
const model = model_alias || route.model_alias;
const target = pickModelByPolicy(model, route, req);
const payload = {
model: target.model,
messages: scrubPII(messages),
tools,
temperature: Math.min(req.body.temperature ?? 0.7, 1),
max_tokens: Math.min(req.body.max_tokens ?? 512, target.limits.max_tokens || 2048)
};
const resp = await axios.post(target.url, payload, {
headers: { Authorization: `Bearer ${target.key}`, 'x-trace-id': traceId },
timeout: target.timeout_ms || 60000
});
res.json({ traceId, provider: target.provider, data: resp.data });
} catch (err: any) {
if (err.response?.status === 429 || err.response?.status >= 500) {
// backoff + optional fallback
if (config.models[model_alias || config.routes.chat.fallback]) {
req.body.model_alias = config.routes.chat.fallback;
return app.handle(req, res);
}
}
res.status(err.response?.status || 500).json({ traceId, error: err.message });
}
});
app.listen(8080, () => console.log('LLM gateway running on :8080'));
5) Canary / A-B / Kill switch
- Canary: route 1-5% traffic to the new model/prompt; compare success/cost/latency before ramp.
- A/B: bucket by user/tenant; log bucket in traces for dashboards.
- Kill switch: mark a model or prompt version as disabled; route instantly to the last stable alias.
6) Caching & idempotency
- Cache deterministic calls (FAQ, structured extraction) with hashed input + model alias.
- Enforce idempotency keys on long/expensive calls; dedupe repeats.
- Protect downstream tools with circuit breakers and timeouts.
7) Multi-tenant controls
- Tenant-scoped limits and model allowlists.
- Billing hooks: log tokens per tenant for showback/chargeback.
- Data isolation: avoid cross-tenant context mixing; filter by
tenant_id.
8) Minimal checklist
- Config store (Git/DB) + hot reload.
- Health checks per provider; auto-disable unhealthy routes.
- Structured logs + metrics (success rate, P95 latency, cost, 429/5xx).
- Audit: who changed routing/prompt/config and when.