/ benchmarks / cache-safety

CacheSafety Bench

在生产启用缓存之前,先评估 LLM 响应复用是否安全。

大多数缓存基准只优化命中率,而 CacheSafety Bench 会同时衡量 Safe Hit Rate、Bad Hit Rate 和 API 节省。

阅读文档

问题

只看命中率还不够。

LLM 语义缓存可以省钱,但一次 bad hit 就会让模型在用户眼里变得不可信。CacheSafety Bench 衡量的是复用是否安全,而不只是两个提示词看起来是否相似。

Core metrics

Measure safety before you measure scale.

SH
SafetySafe Hit Rate

Reuse the user would not notice.

BH
GuardrailBad Hit Rate

The hard safety line for production caching.

$/K
EconomicsCost Saved / 1K Requests

Savings only after safe reuse is counted.

TR
Trap testSemantic Trap Failure Rate

Whether similar-looking prompts still break reuse.

How it works

Three steps before you trust caching.

P1
ReplayReplay request pairs

Run old_request, old_answer, and new_request through a conservative benchmark runner.

P2
JudgeJudge safe reuse

Check whether the old answer really satisfies the new request without hidden violations.

P3
PolicyEstimate safe savings

Export a report and a cautious policy recommendation before production rollout.

Report preview

Static example report

A useful cache policy is one that saves money without making users notice reused answers.

Total pairs2,000
Safe Hit Rate18.4%
Bad Hit Rate0.0%
Cost Saved / 1K Requests$0.42
Recommended policyExact + Canonical
Semantic cacheNot recommended yet

Hosted run

Local benchmark is free and open source. Hosted runs are optional.

NextModel hosted benchmark uses credits to run larger replay jobs, judge models, and generate shareable reports. Local benchmark runs remain open source and endpoint-neutral.

Safe savings should be measured before production caching. Hosted runs are for larger evaluations, not a requirement to use the benchmark.

Developer integration

Works with OpenAI-compatible clients.

CacheSafety Bench remains open source and endpoint-neutral. NextModel is an optional hosted endpoint and production gateway.

OpenAI-compatible example
export OPENAI_API_KEY=...
export OPENAI_BASE_URL=https://api.nextmodel.app/v1

FAQ

Common questions

Is this a semantic cache?

No. CacheSafety Bench is a benchmark for safe LLM response reuse, not a promise that semantic cache should be enabled.

Do I need to use NextModel?

No. Local benchmark runs are open source and endpoint-neutral. NextModel hosted runs are optional.

What is a bad hit?

A bad hit is a reused answer that should not have been returned for the new request because it violates facts, constraints, timing, format, or user expectations.

Can I run it locally?

Yes. The benchmark is designed to run locally first with toy, synthetic, or private datasets you control.

What data do I need?

You need request pairs or replay pairs that include old_request, old_answer, new_request, and ideally a fresh reference answer.

How does this help reduce API cost?

It measures whether reuse is safe before production caching, then estimates savings only from safe hits.

Is this safe for medical/legal/financial use?

No default claim here should treat those domains as safe reuse targets. High-risk reuse should stay conservative.

现在开始

在生产前先衡量 LLM 响应复用是否安全。

先在本地运行开放基准;只有在需要更大规模回放任务和可分享报告时,再选择托管工作流。