Guides
Track LLM spend
Inbound middleware protects your API. But real LLM spend happens on outbound calls — and RateGuard rides the HTTP client your LLM SDK already uses. Not a proxy. Not a new service. No YAML, no Redis, no new attack surface.
rg := rateguard.New(rateguard.Config{
Preset: "llm-heavy",
TokenBudgetPerHour: 1_000_000,
})
client := rg.WrapClient(&http.Client{}) // or rg.Transport(next, opts)
openai := openai.NewClient(option.WithHTTPClient(client))
claude := anthropic.NewClient(option.WithHTTPClient(client))Every call through the wrapped client is budgeted, breaker-protected per provider, and metered with real token usage from the provider's own response — including streaming. 16 OpenAI-compatible hosts are detected out of the box, plus Anthropic, Gemini, Vertex, Azure OpenAI, AWS Bedrock, and self-hosted vLLM / llama.cpp.
Why count at the wire?
Framework-level token counting is unreliable today: LangChain reports incorrect counts in streaming mode (langchain#30429), CrewAI's token_usage disagrees with the provider's own numbers, and every aggregation layer re-implements usage parsing per provider. RateGuard counts below the framework, at the transport layer — the numbers are whatever the provider actually put in the response.
Streaming handled correctly
usage: null intermediate events and Anthropic's split message_start/message_delta shapes are both handled — the two places naive implementations break.enforce vs observe
| Mode | Behavior |
|---|---|
enforce | Default. Exhausted budgets / open breakers synthesize provider-native 429/503 responses with Retry-After and X-RateGuard-Synthesized: true — your SDK's retry logic handles them natively. |
observe | Never blocks. Only meters — ideal for a first rollout week. |
Budget scope is {tenant}:{provider}:{model}:outbound with reserve-then-commit accounting. Calls pass while any budget remains; the final call may overshoot (actual usage is only known post-response), then everything blocks until the window rolls. See Token budgets.
What you get for free
Per-provider circuit breakers (an OpenAI outage doesn't trip DeepSeek), fallback chains across OpenAI-compatible providers, Prometheus counters for calls / fallbacks / tokens, and OTel gen_ai.* spans with automatic cost estimation across 14 priced models — see Observability.