Multimodal large language models (MLLMs) are now embedded in everyday life from enterprises scaling workflows to individuals reading labels at the grocery store. The rapid rate of innovation comes with high risks, and where models succeed and fail in the real world is something we are watching play out in real time. Recent safety research suggests that Claude is outperforming the competition when it comes to multimodal safety. The biggest difference? Saying “no.”
Our recent study exposed four leading models to 726 adversarial prompts targeting illegal activity, disinformation, and unethical behaviour. Human annotators rated nearly 3,000 model outputs for harmfulness across both text-only and text–image inputs. The results revealed persistent vulnerabilities across even the most state-of-the-art models: Pixtral 12B produced harmful content about 62 percent of the time, Qwen about 39 percent, GPT-4o about 19 percent, and Claude about 10 to 11 percent (Van Doren & Ford, 2025).
These results translate to operational risk. The attack playbook looked familiar: role play, refusal suppression, strategic reframing, and distraction noise. None of that is news, which is the point. Social prompts still pull systems toward unsafe helpfulness, even as models improve and new ones launch.
Modern multimodal stacks add encoders, connectors, and training regimes across inputs and tasks. That expansion increases the space where errors and unsafe behaviour can appear, which complicates evaluation and governance (Yin et al., 2024). External work has also shown that robustness can shift under realistic distribution changes across image and text, which is a reminder to test the specific pathways you plan to ship, not just a blended score (Qiu et al., 2024). Precision-sensitive visual tasks remain brittle in places, another signal to route high-risk asks to safer modes or to human review when needed (Cho et al., 2024).
Claude’s lower harmfulness coincided with more frequent refusals. In high-risk contexts, a plausible but unsafe answer is worse than a refusal. If benchmarks penalize abstention, they nudge models to bluff (OpenAI, 2025). That is the opposite of what you want under adversarial pressure.
Traditional scoring collapses judgment into safe versus unsafe and often counts refusals as errors. In practice, the right answer is often not like this, and here is why. To measure that judgment, we move from binary to a three-level scheme that distinguishes how a model stays safe. Our proposed framework scores thoughtful refusals with ethical reasoning at 1, default refusals at 0.5, and harmful responses at 0, and provides reliability checks so teams can use it in production.
In early use, this rubric separates ethical articulation from mechanical blocking and harm. It also lights up where a model chooses caution over engagement, even without a lengthy rationale. Inter-rater statistics indicate that humans can apply these distinctions consistently at scale, which gives product teams a target they can optimize without flying blind.
Binary scoring compresses judgment into a single bit. Our evaluation paradigm adds nuance with a three-level scale:
This approach rewards responsible restraint and distinguishes principled abstention from rote blocking. It also reveals where a model chooses caution over engagement, even when the safer choice may frustrate a user in the moment.
On the tricategorical scale, models separate meaningfully. Some show higher rates of ethical articulation at 1. Others lean on default safety at 0.5. A simple restraint index, R_restraint = P(0.5) − P(0), quantifies caution over harm and flags risk-prone profiles quickly.
Modality still matters. Certain systems struggle to sustain ethical reasoning under visual prompts even when they perform well in text. That argues for modality-aware routing. Steer sensitive tasks to the safer pathway or model.
The most successful jailbreaks in our study were conversational tactics, not exotic exploits. Role play, refusal suppression, strategic reframing, and distraction noise were common and effective. That aligns with broader trustworthiness work that stresses realistic safety scenarios and prompt transformations over keyword filters (Xu et al., 2025). Retrieval-augmented vision–language pipelines can also reduce irrelevant context and improve grounding on some tasks, so evaluate routing and guardrails together with model behaviour (Chen et al., 2024).
It is not enough to publish a single score across text and image plus text. Report results by modality and by harm scenario so buyers can see where risk actually concentrates. Evidence from code switching research points to the same lesson. Targeted exposure and slice-aware evaluation surface failures that naive scaling and blended metrics miss.
In practice, that means separate lines in your evaluation for text only, image plus text, and any other channel you plan to support. Set clear thresholds for deployment. Make pass criteria explicit for harmless engagement and for justified refusal.
Refusal does not have to be a dead end. Good patterns are short and specific. Name the risk, state what cannot be done, and offer a safe alternative or escalation path. In regulated settings, this benefits both user experience and compliance.
Multimodal evaluation fails when it punishes abstention and hides risk in blended reports. Measure what matters, include the attacks you actually face, and report by modality and scenario. In many high-risk cases, no is a safety control, not a failure mode. It keeps critical vulnerabilities out of production.


