Guides
Token budgets
Rate limits count requests; budgets count tokens — the unit your provider bill is written in. RateGuard tracks hourly, daily, and monthly windows simultaneously, on both inbound requests and outbound LLM calls.
rg := rateguard.New(rateguard.Config{
Preset: "llm-heavy",
TokenBudgetPerHour: 250_000,
TokenBudgetPerDay: 2_500_000,
TokenBudgetPerMonth: 250_000_000,
TokenBudgetMode: rateguard.SoftStop, // or HardStop
})hard-stop vs soft-stop
| Mode | When exhausted | Best for |
|---|---|---|
| hard-stop | Reject immediately (429) | Production APIs, fragile upstreams, cost ceilings that must hold |
| soft-stop | Queue instead of rejecting | Streaming and agent workloads where a mid-conversation 429 is worse than a short wait |
Reserve → commit accounting
A call's true cost is only known after the response arrives. RateGuard reserves an estimate up front and commits actual usage after — so parallel calls can't collectively blow through a nearly-empty budget.
Keep concurrency high under hard-stop
EstimatedTokensPerRequest (Go) to bound each reservation to a realistic estimate so many calls can fly at once.Outbound budget scope is {tenant}:{provider}:{model}:outbound. Calls pass while any budget remains; the final call may overshoot (actual usage arrives post-response), then everything blocks until the window rolls.
Let agents check first
The get_token_budget MCP tool answers "how much is left — and would estimated_tokens fit?" without consuming anything. See Agents & MCP.