DeepSeek V3 scored 0.91 on the Bloom benchmark!

For delusional sycophancy, though.

Isha Gupta and the Anthropic team released Bloom, an open-source benchmark framework for behavioral evals of large models. Also, Justin Wetch put a nice UI on it.

Bloom auto-generates eval sets for the major behavioral taxonomies, including delusional sycophancy, instructed long-horizon sabotage, self-preservation, and self-preferential bias.

Then, it simulates user and tool responses and scores the model’s responses on how often/severely the model exhibits the concerning behaviors.

Evals and benchmarking are incredibly hard, and I appreciate that Bloom is open-sourced.

Just make sure when you evaluate your AI system, you evaluate risks and utility, because "undesired behaviors" are often negatively correlated with the "good" ones. For example, training a model for a mental health app to be warm and empathetic can push the model toward agreeing with vulnerable users over grounding them.

Justin’s UI for Bloom

The Bloom technical report