What is CacheSafety Bench?

CacheSafety Bench is a benchmark for measuring safe LLM response reuse. It compares Safe Hit Rate, Bad Hit Rate, Semantic Trap Failure Rate, and cost saved before teams enable production caching.

Is CacheSafety Bench a semantic cache?

No. CacheSafety Bench is a measurement workflow, not a claim that semantic caching should be enabled by default.

/ benchmarks / cache-safety

CacheSafety Bench

Benchmark safe LLM response reuse before you put caching into production.

Run Hosted Benchmark View GitHub Estimate Savings

Most cache benchmarks optimize hit rate. CacheSafety Bench measures Safe Hit Rate, Bad Hit Rate, and API Cost Saved.

Read Docs

Problem

Hit rate is not enough.

LLM semantic caching can save money, but a bad hit makes your model look wrong. CacheSafety Bench measures whether reuse is safe, not just whether two prompts look similar.

Core metrics

Measure safety before you measure scale.

SafetySafe Hit Rate

Reuse the user would not notice.

GuardrailBad Hit Rate

The hard safety line for production caching.

$/K

EconomicsCost Saved / 1K Requests

Savings only after safe reuse is counted.

Trap testSemantic Trap Failure Rate

Whether similar-looking prompts still break reuse.

How it works

Three steps before you trust caching.

ReplayReplay request pairs

Run old_request, old_answer, and new_request through a conservative benchmark runner.

JudgeJudge safe reuse

Check whether the old answer really satisfies the new request without hidden violations.

PolicyEstimate safe savings

Export a report and a cautious policy recommendation before production rollout.

Report preview

Static example report

A useful cache policy is one that saves money without making users notice reused answers.

Total pairs2,000

Safe Hit Rate18.4%

Bad Hit Rate0.0%

Cost Saved / 1K Requests$0.42

Recommended policyExact + Canonical

Semantic cacheNot recommended yet

Estimate Savings Read Docs

Hosted run

Local benchmark is free and open source. Hosted runs are optional.

NextModel hosted benchmark uses credits to run larger replay jobs, judge models, and generate shareable reports. Local benchmark runs remain open source and endpoint-neutral.

Safe savings should be measured before production caching. Hosted runs are for larger evaluations, not a requirement to use the benchmark.

Start with free credits

Developer integration

Works with OpenAI-compatible clients.

CacheSafety Bench remains open source and endpoint-neutral. NextModel is an optional hosted endpoint and production gateway.

OpenAI-compatible example

export OPENAI_API_KEY=...
export OPENAI_BASE_URL=https://api.nextmodel.app/v1

FAQ

Common questions

Is this a semantic cache?

No. CacheSafety Bench is a benchmark for safe LLM response reuse, not a promise that semantic cache should be enabled.

Do I need to use NextModel?

No. Local benchmark runs are open source and endpoint-neutral. NextModel hosted runs are optional.

What is a bad hit?

A bad hit is a reused answer that should not have been returned for the new request because it violates facts, constraints, timing, format, or user expectations.

Can I run it locally?

Yes. The benchmark is designed to run locally first with toy, synthetic, or private datasets you control.

What data do I need?

You need request pairs or replay pairs that include old_request, old_answer, new_request, and ideally a fresh reference answer.

How does this help reduce API cost?

It measures whether reuse is safe before production caching, then estimates savings only from safe hits.

Is this safe for medical/legal/financial use?

No default claim here should treat those domains as safe reuse targets. High-risk reuse should stay conservative.

Start now

Measure safe LLM response reuse before production.

Run the open benchmark locally, then use an optional hosted workflow only when you want larger replay jobs and shareable reports.

Run Hosted Benchmark Read Docs