RAF · self-hosted LLM FinOps control plane

Control your LLM spend before it controls you.

RAF is a drop-in gateway in front of your model calls — spend visibility, budget guardrails, caching, retries, PII controls — that emails you a weekly report proving exactly what it saved.

Get the sample report → Book a pilot

Route one workflow in < 1 hour First report in 24h No app rewrite

raf · gateway/v1/chat/completionslive

Spent today$2,418

Saved today$1,063

Cache hit38%

14:32:07support-summarize · gpt-4oCACHE HIT$0.000

14:32:07doc-classify · haikuMISS$0.004

14:32:06enrich-webhook · sonnetPII REDACT$0.011

14:32:06copy-gen · gpt-4oBUDGET BLOCK$0.38

14:32:05support-summarize · gpt-4oCACHE HIT$0.000

14:32:05agent-tool-call · sonnetRETRY ✓$0.009

14:32:04doc-classify · haikuCACHE HIT$0.000

14:32:04research-eval · opusMISS$0.042

9 routes · 4 providers · 0 uncappedledger → postgres

Q1 — where is our LLM spend going? answered Q2 — what can we safely eliminate? answered Q3 — how do we stop surprise bills? answered Q4 — are we leaking sensitive data? answered Q5 — is quality holding as we cut cost? answered cache savings 96.3% budgets enforced pre-call Q1 — where is our LLM spend going? answered Q2 — what can we safely eliminate? answered Q3 — how do we stop surprise bills? answered Q4 — are we leaking sensitive data? answered Q5 — is quality holding as we cut cost? answered cache savings 96.3% budgets enforced pre-call

Five questions, answered the same day

Every company buys tokens. Almost no one can answer these questions.

RAF turns each one from a quarterly fire-drill into a line in a report.

Where is our LLM spend actually going?

ledger by team, feature, tenant

What spend can we safely eliminate?

cache + downgrade candidates

How do we stop surprise bills?

budgets enforced pre-call

Are we leaking sensitive data into calls?

PII redaction + egress log

Is answer quality holding as we cut cost?

graders + replay eval

The problem

Buying tokens is one line of code. Governing them is nobody's job.

Teams start with direct SDK calls. It works — until usage grows. Then the same failures show up in every company at once.

Spend opacity

No attribution by product, tenant, feature, model, provider, or prompt. The invoice is one number.

unattributed

No hard guardrails

Budgets are tracked after the fact, not enforced before the call ever fires.

no cap

Repeated calls waste money

Identical prompts get re-sent thousands of times with no shared cache between teams.

duplicate spend

Runaway sessions

One bugged agent or retry loop can burn $20K over a weekend. Daily caps don't catch it.

$20K weekend

Silent data exposure

Prompts carry PII, secrets, and customer data out to providers with no consistent control.

unaudited egress

Provider incidents leak

Retries, rate limits, and failover are hand-rolled differently by every team.

inconsistent

Silent cost regressions

A prompt change triples cost-per-call. Nobody notices until the invoice arrives.

3× silent

Leadership can't see ROI

Spend grows, but there's no continuous proof of savings, avoided risk, or quality held.

no proof

How it works

Change one base URL. Keep your code. Keep your providers.

RAF speaks the OpenAI-compatible API. Point your client at it and every call inherits budgets, cache, retries, redaction, and a durable cost ledger — no app rewrite.

Connect traffic

Swap OPENAI_BASE_URL to your RAF endpoint. Streaming and non-streaming both pass through untouched.

See spend

Every call lands in a durable ledger: tokens, cost, latency, cache state, provider, retry — by tenant and feature.

Control spend

Turn on budgets, exact cache, retry/admission, PII redaction and egress allow-lists. Policies apply at the gateway.

Prove value

RAF renders a weekly report — saved, prevented, protected, what to optimize next — to inbox, Slack, and a static URL.

bash — your-app/.env

# Before — calling the provider directly
OPENAI_BASE_URL=https://api.openai.com/v1
 
# After — route through RAF. That's the change.
OPENAI_BASE_URL=https://raf.yourco.internal/v1
OPENAI_API_KEY=raf_sk_live_••••••••
 
$ raf doctor
✓ provider keys      openai, anthropic, bedrock
✓ cache backend      redis · shared
✓ ledger             postgres · durable
✓ proxy compatible   /v1/chat/completions
✓ ready in 00:11:42 — send a test request

The control plane

Nine controls. One gateway. On by default.

Every proxied call passes through the same policy layer — so control is centralized instead of re-implemented in every service.

pre-call

Budget guardrails

Caps by tenant, route, feature, model, and environment — enforced with atomic reserve-and-settle before the provider call.

dailymonthly50/80/90/100%

free

Exact cache

Identical prompts return instantly, with avoided provider cost attributed to the savings report. Tenant-scoped by default.

redisshared-replica

resilient

Retry & admission

Jittered backoff on retryable errors and per-provider token-bucket admission — no thundering herds, no surprise 429s.

jittertoken-bucket

fail-closed

PII & secret redaction

Default pack catches email, SSN, cards, IPs, AWS keys, JWTs, tokens, PEM material — redact or block at egress.

9 categoriesstrict-block

audited

Egress allow-list

Prompts can only reach approved destinations. Every block is recorded with reason for security review.

allow-listpolicy-log

queryable

Durable cost ledger

Every call — even blocked ones — recorded with tokens, cost, latency, cache, retry, and error. Export CSV, JSONL, OTLP.

per-tenantOpenTelemetry

breaker

Runaway circuit breaker

Session-velocity limits stop retry storms and near-duplicate loops before they become a $20K weekend invoice.

session-caploop-guard

unit-economics

Cost-per-task attribution

Roll calls up into what one completed task, tenant, or outcome actually costs — not just what you spent.

per-taskper-tenant

recommends

Optimization advisor

Flags prompt bloat, duplicate calls, cheaper-model candidates, and provider reliability — ranked by est. savings.

cache-opsdowngrade

The weekly value report

The four pages your CFO actually forwards.

A dashboard waits for someone to look. RAF's report shows up — Monday 9am, in inbox and Slack, CFO-readable, four pages.

PAGE 1

The Number"RAF saved your team $X" — gross vs. net, with the explicit calculation.

PAGE 2

What RAF savedCache savings, prevented budget spend, retry recoveries, downgrade wins.

PAGE 3

What RAF protectedPII redactions, secrets blocked, egress denials, policy outcomes.

PAGE 4

What to do nextRecommendations ranked by estimated savings, risk, and effort.

Get a sample report →

lineumPAGE 1 / 4

RAF saved your team

$12,480

gross $41,900 − cache $8,640 − prevented $2,910 − recovered $930
= net $29,420 · 29.8% cut

Realized savings

$9,570

cache + downgrades

Prevented spend

$2,910

budget blocks, labeled apart

Protected

1,204

PII redactions this week

Next-week upside

+$5,200

3 recommendations

The dashboard

And when they do go look — the whole picture is one screen.

One normalized ledger, two views. The report pushes; the dashboard pulls. Filter by feature, route, tenant, model, provider, or environment and every panel updates.

RAF

ExecutiveEngineeringCostSafety

live · 9 routes

Spend this week

$41.9K

▲ 6.1% vol

Net after RAF

$29.4K

▼ 29.8%

Cache hit rate

38.4%

▲ 4.2pp

p99 latency

412ms

▼ 18%

Gross vs. net spend · 8 weeksgrossnetsaved

W31W32W33W34W35W36W37W38

Spend by model

gpt-4o · 46% sonnet · 28% haiku · 26%

Top features by spend

support-summarize$14.2K

doc-classify$9.8K

copy-gen$7.1K

enrich-webhook$4.5K

Before & after

Same apps. Same providers. One layer of control in between.

Nothing about your stack changes except what now flows through the middle — and what you finally get to see.

Before — direct callsblind

Your apps & agents

↓

OpenAI ?

Anthropic ?

Bedrock ?

Azure ?

No caps · no cache · no attribution · PII unaudited · invoice arrives a month later.

route
through→

After — through RAF−29.8%

Your apps & agents unchanged

↓

RAF gateway

budgetcacheretryredactegressledger

↓

OpenAI ✓

Anthropic ✓

Bedrock ✓

Azure ✓

Every call capped, cached, attributed, redacted — and in the weekly report.

Time to value

Installed before lunch. Proving its keep by the next morning.

RAF is built for one outcome: a buyer learns where their LLM money goes — and what they can safely save — within a day, not a quarter.

<1hr

to route your first production workflow through RAF — base URL only.

24h

to your first credible value report: saved, prevented, protected, next.

100%

of calls carry cost, token, latency, cache, retry & status metadata.

Start the pilot

Route one workflow. Tomorrow, see exactly where your money went.

Get the sample value report in your inbox, or book a 30-minute design-partner session and we'll scope your first workflow together.

Book a pilot Self-host quickstart

Self-hosted · runs in your VPC · no raw prompts stored.