Guides

Track LLM spend

Inbound middleware protects your API. But real LLM spend happens on outbound calls — and RateGuard rides the HTTP client your LLM SDK already uses. Not a proxy. Not a new service. No YAML, no Redis, no new attack surface.

rg := rateguard.New(rateguard.Config{
    Preset:             "llm-heavy",
    TokenBudgetPerHour: 1_000_000,
})

client := rg.WrapClient(&http.Client{})   // or rg.Transport(next, opts)
openai := openai.NewClient(option.WithHTTPClient(client))
claude := anthropic.NewClient(option.WithHTTPClient(client))

Every call through the wrapped client is budgeted, breaker-protected per provider, and metered with real token usage from the provider's own response — including streaming. 16 OpenAI-compatible hosts are detected out of the box, plus Anthropic, Gemini, Vertex, Azure OpenAI, AWS Bedrock, and self-hosted vLLM / llama.cpp.

Why count at the wire?

Framework-level token counting is unreliable today: LangChain reports incorrect counts in streaming mode (langchain#30429), CrewAI's token_usage disagrees with the provider's own numbers, and every aggregation layer re-implements usage parsing per provider. RateGuard counts below the framework, at the transport layer — the numbers are whatever the provider actually put in the response.

Streaming handled correctly

SSE bytes pass through untouched while usage is extracted from a bounded side-scan. OpenAI's usage: null intermediate events and Anthropic's split message_start/message_delta shapes are both handled — the two places naive implementations break.

enforce vs observe

Mode	Behavior
`enforce`	Default. Exhausted budgets / open breakers synthesize provider-native 429/503 responses with `Retry-After` and `X-RateGuard-Synthesized: true` — your SDK's retry logic handles them natively.
`observe`	Never blocks. Only meters — ideal for a first rollout week.

Budget scope is {tenant}:{provider}:{model}:outbound with reserve-then-commit accounting. Calls pass while any budget remains; the final call may overshoot (actual usage is only known post-response), then everything blocks until the window rolls. See Token budgets.

What you get for free

Per-provider circuit breakers (an OpenAI outage doesn't trip DeepSeek), fallback chains across OpenAI-compatible providers, Prometheus counters for calls / fallbacks / tokens, and OTel gen_ai.* spans with automatic cost estimation across 14 priced models — see Observability.