Reuse the user would not notice.
CacheSafety Bench
Benchmark safe LLM response reuse before you put caching into production.
Most cache benchmarks optimize hit rate. CacheSafety Bench measures Safe Hit Rate, Bad Hit Rate, and API Cost Saved.
Read DocsProblem
Hit rate is not enough.
LLM semantic caching can save money, but a bad hit makes your model look wrong. CacheSafety Bench measures whether reuse is safe, not just whether two prompts look similar.
Core metrics
Measure safety before you measure scale.
The hard safety line for production caching.
Savings only after safe reuse is counted.
Whether similar-looking prompts still break reuse.
How it works
Three steps before you trust caching.
Run old_request, old_answer, and new_request through a conservative benchmark runner.
Check whether the old answer really satisfies the new request without hidden violations.
Export a report and a cautious policy recommendation before production rollout.
Report preview
Static example report
A useful cache policy is one that saves money without making users notice reused answers.
Hosted run
Local benchmark is free and open source. Hosted runs are optional.
NextModel hosted benchmark uses credits to run larger replay jobs, judge models, and generate shareable reports. Local benchmark runs remain open source and endpoint-neutral.
Safe savings should be measured before production caching. Hosted runs are for larger evaluations, not a requirement to use the benchmark.
Developer integration
Works with OpenAI-compatible clients.
CacheSafety Bench remains open source and endpoint-neutral. NextModel is an optional hosted endpoint and production gateway.
export OPENAI_API_KEY=...
export OPENAI_BASE_URL=https://api.nextmodel.app/v1FAQ
Common questions
Is this a semantic cache?
No. CacheSafety Bench is a benchmark for safe LLM response reuse, not a promise that semantic cache should be enabled.
Do I need to use NextModel?
No. Local benchmark runs are open source and endpoint-neutral. NextModel hosted runs are optional.
What is a bad hit?
A bad hit is a reused answer that should not have been returned for the new request because it violates facts, constraints, timing, format, or user expectations.
Can I run it locally?
Yes. The benchmark is designed to run locally first with toy, synthetic, or private datasets you control.
What data do I need?
You need request pairs or replay pairs that include old_request, old_answer, new_request, and ideally a fresh reference answer.
How does this help reduce API cost?
It measures whether reuse is safe before production caching, then estimates savings only from safe hits.
Is this safe for medical/legal/financial use?
No default claim here should treat those domains as safe reuse targets. High-risk reuse should stay conservative.
Start now
Measure safe LLM response reuse before production.
Run the open benchmark locally, then use an optional hosted workflow only when you want larger replay jobs and shareable reports.