为什么要做这个基准
大多数缓存基准追求命中率,而 CacheSafety Bench 关注更严格的问题:旧答案是否能安全回答新请求,且不会让用户察觉到错误复用。
| Safe Hit Rate | Reusable answers the user would not notice were cached |
|---|---|
| Bad Hit Rate | Unsafe reused answers |
| Cost Saved / 1K Requests | Estimated savings under a safety constraint |
| Semantic Trap Failure Rate | How often similar-looking prompts still fail reuse |