# RateGuard — AI-Native Rate Limiting SDK
> Go, Node.js, Python middleware. Outbound LLM spend tracking (wrap your SDK's HTTP client), pre-flight rate limit queries for AI agents (MCP), token budgets, GenAI OTel, circuit breakers, provider fallback, loop detection, guardrails. No proxy needed.

## Outbound spend tracking (the headline)
Wrap the HTTP client your LLM SDK already uses — real token usage (JSON + SSE streaming) metered into budgets, per-provider circuit breakers, OpenAI-compatible fallback:
- Go: `client := rg.WrapClient(&http.Client{})` → `openai.NewClient(option.WithHTTPClient(client))`
- Node: `new OpenAI({ fetch: rg.wrapFetch() })`
- Python: `OpenAI(http_client=rg.wrap_httpx_client())` (httpx lazy import, zero runtime deps kept)
Detects 16 OpenAI-compatible hosts + Anthropic, Gemini, Vertex, Azure OpenAI, AWS Bedrock, self-hosted vLLM/llama.cpp. Enforce mode synthesizes provider-native 429/503 (SDK retry logic just works); observe mode only meters.

## Quick Start

### Go
```go
import rateguard "github.com/varbees/rateguard/packages/sdk-go"
rg := rateguard.New(rateguard.Config{Preset: "streaming-llm"})
http.Handle("/metrics", rg.Metrics())
// Expose limits to MCP clients (Claude Code, Cursor, custom agents):
_ = rg.ServeMCP(ctx, os.Stdin, os.Stdout)
```

### Node.js
```ts
import { RateGuard } from '@varbees/rateguard-node';
const rg = new RateGuard({ preset: 'streaming-llm' });
const tools = rg.mcpTools(); // 5 MCP tools, peek semantics
```

### Python
```python
from rateguard import RateGuard
rg = RateGuard(preset="streaming-llm")
tools = rg.mcp_tools()  # 5 MCP tools, peek semantics
```

## Architecture
RateGuard is MIDDLEWARE (runs inside your app), not a proxy or gateway. Three SDKs, identical behavior across Go/Node/Python.

## Core Algorithm (all 3 SDKs)
Token Bucket (RFC standard, same as Kong/Envoy/AWS): `tokens = min(burst, tokens + elapsed × rps)`
Every limiter also implements Peek — a non-consuming pre-flight query.

## 8 Presets
| Preset | RPS | Tokens/hr | Mode |
|---|---|---|---|
| dev | 10 | 1K | hard-stop |
| standard | 100 | 10K | hard-stop |
| high-throughput | 1000 | 100K | hard-stop |
| streaming-llm | 200 | 500K | soft-stop |
| agent-orchestrator | 500 | 1M | soft-stop |
| llm-heavy | 500 | 250K | soft-stop |
| mcp-server | 30 | 50K | hard-stop |
| strict-upstream-protection | 50 | 5K | hard-stop |

## Features
- MCP tools (5, all SDKs): get_rate_limit_state, get_token_budget, get_circuit_breaker_state, check_loop, list_limits — peek semantics, querying never consumes budget
- MCP stdio server (Go): zero-dependency JSON-RPC 2.0, plugs into any MCP client config
- Loop detection: SHA-256 payload fingerprinting + max sequence depth, LRU-bounded; wired into Go middleware via X-Sequence-Depth header (429 loop_detected)
- Token budgets (hour/day/month, hard-stop or soft-stop); estimate-based reservations (Go: EstimatedTokensPerRequest) keep concurrency high under hard-stop
- Circuit breakers (closed → open → half-open)
- GenAI OTel per semconv: span name "{operation} {model}", gen_ai.usage.input_tokens/output_tokens, low-cardinality error.type; public API rg.StartGenAICall → span.End with auto cost estimation, TTFT/TPOT
- 14-model pricing table, verified against provider pricing pages
- Content guardrails: PII, prompt injection (5 vectors), token/length limits; wired into Go middleware (422)
- Prometheus /metrics endpoint (stdlib only): live counters — requests, rate limit hits, budget exhaustion, breaker trips, tokens consumed, loop stats
- IETF RateLimit-* response headers (draft-ietf-httpapi-ratelimit-headers)
- Events/webhooks for every request (Go)
- Redis distributed limiter: atomic Lua GCRA, read-only peek script (Go)

## Tests
All 3 SDKs green — 155 total: `go test ./...` (61 passing), `bun run test` (46 passing), `pytest -q` (48 passing)

## Docs
- AGENTS.md: full agent contract with architecture, domain types, rules, commands
- ARCHITECTURE.md: positioning vs Datadog/Kong/Cloudflare
- docs/API_REFERENCE.md: all config, adapters, provider chain, guardrails
- docs/GENAI_OBSERVABILITY.md: OTel integration, model pricing, streaming
- docs/RELEASE_NOTES.md: changelog

## vs Competition
The only in-process SDK where agents can ask "can I make this call?" before spending tokens — across Go, Node, and Python. Nobody else combines multi-language SDKs + MCP pre-flight tools + LLM token budgets + GenAI OTel + circuit breakers + loop detection + guardrails in one package.