RAF is a drop-in gateway in front of your model calls — spend visibility, budget guardrails, caching, retries, PII controls — that emails you a weekly report proving exactly what it saved.
RAF turns each one from a quarterly fire-drill into a line in a report.
Where is our LLM spend actually going?
What spend can we safely eliminate?
How do we stop surprise bills?
Are we leaking sensitive data into calls?
Is answer quality holding as we cut cost?
Teams start with direct SDK calls. It works — until usage grows. Then the same failures show up in every company at once.
No attribution by product, tenant, feature, model, provider, or prompt. The invoice is one number.
unattributedBudgets are tracked after the fact, not enforced before the call ever fires.
no capIdentical prompts get re-sent thousands of times with no shared cache between teams.
duplicate spendOne bugged agent or retry loop can burn $20K over a weekend. Daily caps don't catch it.
$20K weekendPrompts carry PII, secrets, and customer data out to providers with no consistent control.
unaudited egressRetries, rate limits, and failover are hand-rolled differently by every team.
inconsistentA prompt change triples cost-per-call. Nobody notices until the invoice arrives.
3× silentSpend grows, but there's no continuous proof of savings, avoided risk, or quality held.
no proofRAF speaks the OpenAI-compatible API. Point your client at it and every call inherits budgets, cache, retries, redaction, and a durable cost ledger — no app rewrite.
Swap OPENAI_BASE_URL to your RAF endpoint. Streaming and non-streaming both pass through untouched.
Every call lands in a durable ledger: tokens, cost, latency, cache state, provider, retry — by tenant and feature.
Turn on budgets, exact cache, retry/admission, PII redaction and egress allow-lists. Policies apply at the gateway.
RAF renders a weekly report — saved, prevented, protected, what to optimize next — to inbox, Slack, and a static URL.
Every proxied call passes through the same policy layer — so control is centralized instead of re-implemented in every service.
Caps by tenant, route, feature, model, and environment — enforced with atomic reserve-and-settle before the provider call.
dailymonthly50/80/90/100%Identical prompts return instantly, with avoided provider cost attributed to the savings report. Tenant-scoped by default.
redisshared-replicaJittered backoff on retryable errors and per-provider token-bucket admission — no thundering herds, no surprise 429s.
jittertoken-bucketDefault pack catches email, SSN, cards, IPs, AWS keys, JWTs, tokens, PEM material — redact or block at egress.
9 categoriesstrict-blockPrompts can only reach approved destinations. Every block is recorded with reason for security review.
allow-listpolicy-logEvery call — even blocked ones — recorded with tokens, cost, latency, cache, retry, and error. Export CSV, JSONL, OTLP.
per-tenantOpenTelemetrySession-velocity limits stop retry storms and near-duplicate loops before they become a $20K weekend invoice.
session-caploop-guardRoll calls up into what one completed task, tenant, or outcome actually costs — not just what you spent.
per-taskper-tenantFlags prompt bloat, duplicate calls, cheaper-model candidates, and provider reliability — ranked by est. savings.
cache-opsdowngradeA dashboard waits for someone to look. RAF's report shows up — Monday 9am, in inbox and Slack, CFO-readable, four pages.
One normalized ledger, two views. The report pushes; the dashboard pulls. Filter by feature, route, tenant, model, provider, or environment and every panel updates.
Nothing about your stack changes except what now flows through the middle — and what you finally get to see.
budgetcacheretryredactegressledgerRAF is built for one outcome: a buyer learns where their LLM money goes — and what they can safely save — within a day, not a quarter.
Get the sample value report in your inbox, or book a 30-minute design-partner session and we'll scope your first workflow together.