Reuse the user would not notice.
问题
只看命中率还不够。
LLM 语义缓存可以省钱,但一次 bad hit 就会让模型在用户眼里变得不可信。CacheSafety Bench 衡量的是复用是否安全,而不只是两个提示词看起来是否相似。
Core metrics
Measure safety before you measure scale.
The hard safety line for production caching.
Savings only after safe reuse is counted.
Whether similar-looking prompts still break reuse.
How it works
Three steps before you trust caching.
Run old_request, old_answer, and new_request through a conservative benchmark runner.
Check whether the old answer really satisfies the new request without hidden violations.
Export a report and a cautious policy recommendation before production rollout.
Hosted run
Local benchmark is free and open source. Hosted runs are optional.
NextModel hosted benchmark uses credits to run larger replay jobs, judge models, and generate shareable reports. Local benchmark runs remain open source and endpoint-neutral.
Safe savings should be measured before production caching. Hosted runs are for larger evaluations, not a requirement to use the benchmark.
Developer integration
Works with OpenAI-compatible clients.
CacheSafety Bench remains open source and endpoint-neutral. NextModel is an optional hosted endpoint and production gateway.
export OPENAI_API_KEY=...
export OPENAI_BASE_URL=https://api.nextmodel.app/v1FAQ
Common questions
Is this a semantic cache?
No. CacheSafety Bench is a benchmark for safe LLM response reuse, not a promise that semantic cache should be enabled.
Do I need to use NextModel?
No. Local benchmark runs are open source and endpoint-neutral. NextModel hosted runs are optional.
What is a bad hit?
A bad hit is a reused answer that should not have been returned for the new request because it violates facts, constraints, timing, format, or user expectations.
Can I run it locally?
Yes. The benchmark is designed to run locally first with toy, synthetic, or private datasets you control.
What data do I need?
You need request pairs or replay pairs that include old_request, old_answer, new_request, and ideally a fresh reference answer.
How does this help reduce API cost?
It measures whether reuse is safe before production caching, then estimates savings only from safe hits.
Is this safe for medical/legal/financial use?
No default claim here should treat those domains as safe reuse targets. High-risk reuse should stay conservative.