Guides

Token budgets

Rate limits count requests; budgets count tokens — the unit your provider bill is written in. RateGuard tracks hourly, daily, and monthly windows simultaneously, on both inbound requests and outbound LLM calls.

rg := rateguard.New(rateguard.Config{
    Preset:              "llm-heavy",
    TokenBudgetPerHour:  250_000,
    TokenBudgetPerDay:   2_500_000,
    TokenBudgetPerMonth: 250_000_000,
    TokenBudgetMode:     rateguard.SoftStop, // or HardStop
})

hard-stop vs soft-stop

Mode	When exhausted	Best for
hard-stop	Reject immediately (429)	Production APIs, fragile upstreams, cost ceilings that must hold
soft-stop	Queue instead of rejecting	Streaming and agent workloads where a mid-conversation 429 is worse than a short wait

Reserve → commit accounting

A call's true cost is only known after the response arrives. RateGuard reserves an estimate up front and commits actual usage after — so parallel calls can't collectively blow through a nearly-empty budget.

Keep concurrency high under hard-stop

By default a hard-stop reservation holds all remaining budget until the response lands. Set EstimatedTokensPerRequest (Go) to bound each reservation to a realistic estimate so many calls can fly at once.

Outbound budget scope is {tenant}:{provider}:{model}:outbound. Calls pass while any budget remains; the final call may overshoot (actual usage arrives post-response), then everything blocks until the window rolls.

Let agents check first

The get_token_budget MCP tool answers "how much is left — and would estimated_tokens fit?" without consuming anything. See Agents & MCP.