RAF · self-hosted LLM FinOps control plane

Control your LLM spend before it controls you.

RAF is a drop-in gateway in front of your model calls — spend visibility, budget guardrails, caching, retries, PII controls — that emails you a weekly report proving exactly what it saved.

Route one workflow in < 1 hour First report in 24h No app rewrite
raf · gateway/v1/chat/completionslive
Spent today$2,418
Saved today$1,063
Cache hit38%
14:32:07support-summarize · gpt-4oCACHE HIT$0.000
14:32:07doc-classify · haikuMISS$0.004
14:32:06enrich-webhook · sonnetPII REDACT$0.011
14:32:06copy-gen · gpt-4oBUDGET BLOCK$0.38
14:32:05support-summarize · gpt-4oCACHE HIT$0.000
14:32:05agent-tool-call · sonnetRETRY ✓$0.009
14:32:04doc-classify · haikuCACHE HIT$0.000
14:32:04research-eval · opusMISS$0.042
9 routes · 4 providers · 0 uncappedledger → postgres
Q1 — where is our LLM spend going? answered Q2 — what can we safely eliminate? answered Q3 — how do we stop surprise bills? answered Q4 — are we leaking sensitive data? answered Q5 — is quality holding as we cut cost? answered cache savings 96.3% budgets enforced pre-call Q1 — where is our LLM spend going? answered Q2 — what can we safely eliminate? answered Q3 — how do we stop surprise bills? answered Q4 — are we leaking sensitive data? answered Q5 — is quality holding as we cut cost? answered cache savings 96.3% budgets enforced pre-call
Five questions, answered the same day

Every company buys tokens. Almost no one can answer these questions.

RAF turns each one from a quarterly fire-drill into a line in a report.

Q1

Where is our LLM spend actually going?

ledger by team, feature, tenant
Q2

What spend can we safely eliminate?

cache + downgrade candidates
Q3

How do we stop surprise bills?

budgets enforced pre-call
Q4

Are we leaking sensitive data into calls?

PII redaction + egress log
Q5

Is answer quality holding as we cut cost?

graders + replay eval
The problem

Buying tokens is one line of code. Governing them is nobody's job.

Teams start with direct SDK calls. It works — until usage grows. Then the same failures show up in every company at once.

01

Spend opacity

No attribution by product, tenant, feature, model, provider, or prompt. The invoice is one number.

unattributed
02

No hard guardrails

Budgets are tracked after the fact, not enforced before the call ever fires.

no cap
03

Repeated calls waste money

Identical prompts get re-sent thousands of times with no shared cache between teams.

duplicate spend
04

Runaway sessions

One bugged agent or retry loop can burn $20K over a weekend. Daily caps don't catch it.

$20K weekend
05

Silent data exposure

Prompts carry PII, secrets, and customer data out to providers with no consistent control.

unaudited egress
06

Provider incidents leak

Retries, rate limits, and failover are hand-rolled differently by every team.

inconsistent
07

Silent cost regressions

A prompt change triples cost-per-call. Nobody notices until the invoice arrives.

3× silent
08

Leadership can't see ROI

Spend grows, but there's no continuous proof of savings, avoided risk, or quality held.

no proof
How it works

Change one base URL. Keep your code. Keep your providers.

RAF speaks the OpenAI-compatible API. Point your client at it and every call inherits budgets, cache, retries, redaction, and a durable cost ledger — no app rewrite.

1

Connect traffic

Swap OPENAI_BASE_URL to your RAF endpoint. Streaming and non-streaming both pass through untouched.

2

See spend

Every call lands in a durable ledger: tokens, cost, latency, cache state, provider, retry — by tenant and feature.

3

Control spend

Turn on budgets, exact cache, retry/admission, PII redaction and egress allow-lists. Policies apply at the gateway.

4

Prove value

RAF renders a weekly report — saved, prevented, protected, what to optimize next — to inbox, Slack, and a static URL.

bash — your-app/.env
# Before — calling the provider directly
OPENAI_BASE_URL=https://api.openai.com/v1
 
# After — route through RAF. That's the change.
OPENAI_BASE_URL=https://raf.yourco.internal/v1
OPENAI_API_KEY=raf_sk_live_••••••••
 
$ raf doctor
provider keys openai, anthropic, bedrock
cache backend redis · shared
ledger postgres · durable
proxy compatible /v1/chat/completions
ready in 00:11:42 — send a test request
The control plane

Nine controls. One gateway. On by default.

Every proxied call passes through the same policy layer — so control is centralized instead of re-implemented in every service.

pre-call

Budget guardrails

Caps by tenant, route, feature, model, and environment — enforced with atomic reserve-and-settle before the provider call.

dailymonthly50/80/90/100%
free

Exact cache

Identical prompts return instantly, with avoided provider cost attributed to the savings report. Tenant-scoped by default.

redisshared-replica
resilient

Retry & admission

Jittered backoff on retryable errors and per-provider token-bucket admission — no thundering herds, no surprise 429s.

jittertoken-bucket
fail-closed

PII & secret redaction

Default pack catches email, SSN, cards, IPs, AWS keys, JWTs, tokens, PEM material — redact or block at egress.

9 categoriesstrict-block
audited

Egress allow-list

Prompts can only reach approved destinations. Every block is recorded with reason for security review.

allow-listpolicy-log
queryable

Durable cost ledger

Every call — even blocked ones — recorded with tokens, cost, latency, cache, retry, and error. Export CSV, JSONL, OTLP.

per-tenantOpenTelemetry
breaker

Runaway circuit breaker

Session-velocity limits stop retry storms and near-duplicate loops before they become a $20K weekend invoice.

session-caploop-guard
unit-economics

Cost-per-task attribution

Roll calls up into what one completed task, tenant, or outcome actually costs — not just what you spent.

per-taskper-tenant
recommends

Optimization advisor

Flags prompt bloat, duplicate calls, cheaper-model candidates, and provider reliability — ranked by est. savings.

cache-opsdowngrade
The weekly value report

The four pages your CFO actually forwards.

A dashboard waits for someone to look. RAF's report shows up — Monday 9am, in inbox and Slack, CFO-readable, four pages.

PAGE 1
The Number"RAF saved your team $X" — gross vs. net, with the explicit calculation.
PAGE 2
What RAF savedCache savings, prevented budget spend, retry recoveries, downgrade wins.
PAGE 3
What RAF protectedPII redactions, secrets blocked, egress denials, policy outcomes.
PAGE 4
What to do nextRecommendations ranked by estimated savings, risk, and effort.
Get a sample report →
lineumPAGE 1 / 4
RAF saved your team
$12,480
gross $41,900 cache $8,640 prevented $2,910 recovered $930
= net $29,420  ·  29.8% cut
Realized savings
$9,570
cache + downgrades
Prevented spend
$2,910
budget blocks, labeled apart
Protected
1,204
PII redactions this week
Next-week upside
+$5,200
3 recommendations
The dashboard

And when they do go look — the whole picture is one screen.

One normalized ledger, two views. The report pushes; the dashboard pulls. Filter by feature, route, tenant, model, provider, or environment and every panel updates.

RAF
ExecutiveEngineeringCostSafety
live · 9 routes
Spend this week
$41.9K
▲ 6.1% vol
Net after RAF
$29.4K
▼ 29.8%
Cache hit rate
38.4%
▲ 4.2pp
p99 latency
412ms
▼ 18%
Gross vs. net spend · 8 weeksgrossnetsaved
W31W32W33W34W35W36W37W38
Spend by model
gpt-4o · 46% sonnet · 28% haiku · 26%
Top features by spend
support-summarize$14.2K
doc-classify$9.8K
copy-gen$7.1K
enrich-webhook$4.5K
Before & after

Same apps. Same providers. One layer of control in between.

Nothing about your stack changes except what now flows through the middle — and what you finally get to see.

Before — direct callsblind
Your apps & agents
OpenAI ?
Anthropic ?
Bedrock ?
Azure ?
No caps · no cache · no attribution · PII unaudited · invoice arrives a month later.
route
through
After — through RAF−29.8%
Your apps & agents unchanged
RAF gateway
budgetcacheretryredactegressledger
OpenAI
Anthropic
Bedrock
Azure
Every call capped, cached, attributed, redacted — and in the weekly report.
Time to value

Installed before lunch. Proving its keep by the next morning.

RAF is built for one outcome: a buyer learns where their LLM money goes — and what they can safely save — within a day, not a quarter.

<1hr
to route your first production workflow through RAF — base URL only.
24h
to your first credible value report: saved, prevented, protected, next.
100%
of calls carry cost, token, latency, cache, retry & status metadata.
Start the pilot

Route one workflow. Tomorrow, see exactly where your money went.

Get the sample value report in your inbox, or book a 30-minute design-partner session and we'll scope your first workflow together.

Self-hosted · runs in your VPC · no raw prompts stored.