Guides
Semantic caching
Exact-match caching misses the common case: two prompts that mean the same thing but differ in wording never hit. Semantic caching embeds the prompt and serves a prior response when a sufficiently similar prompt was already answered — real cost and latency savings on workloads with duplicate intent (support bots, agent retries, templated prompts with small variations).
No bundled embedding model — on purpose
Embedder is a one-method interface — bring the OpenAI/Cohere/Voyage embeddings API, a local sentence-transformer binding, or anything else that turns text into a vector.Set it up
type Embedder interface {
Embed(ctx context.Context, text string) ([]float32, error)
}
client := rg.WrapClient(&http.Client{}, rateguard.OutboundOptions{
SemanticCache: &rateguard.SemanticCacheOptions{
Embedder: myEmbedder, // required — no default
SimilarityThreshold: 0.92, // default 0.92
TTL: time.Hour, // default 1h
MaxEntriesPerScope: 500, // default 500, oldest-first eviction
},
})A cache hit skips the network call, the per-provider circuit breaker, and the token budget reservation entirely — this is a real dollar saved, not just a faster response. The response carries X-RateGuard-Cache: hit so callers and observability can tell it apart from a live call.
What gets cached, and what never does
| Rule | Why |
|---|---|
| Scoped per provider:model | An entry for openai:gpt-4o never serves an anthropic:claude-opus-4-5 request, even with an identical prompt. |
| Streaming requests always bypass the cache | Replaying a cached body as a fabricated SSE stream would misrepresent TTFT/TPOT to the caller. |
| Only HTTP 200, non-synthesized responses are stored | A provider error or a RateGuard-synthesized 429/503 rejection is never cached. |
| An Embedder error degrades to a real call | Caching is a cost optimization, never a reason to fail a request. |
Prompt extraction
RateGuard understands OpenAI- and Anthropic-shaped chat request bodies — messages[].content as a plain string or as typed parts, plus Anthropic's top-level system field. Non-text parts (images, audio) are ignored for embedding purposes; only the text content contributes to the similarity comparison.
Tuning the threshold
SimilarityThreshold (default 0.92) is the one knob that matters. Higher is safer — fewer false-positive hits where a semantically different prompt gets served a stale answer — but lower catches more paraphrases. Start at the default and measure your own hit rate before loosening it; the right value depends entirely on your embedding model and your workload's prompt diversity.