As generative AI systems increasingly combine text and images, a new Multimodal Safety Report from Enkrypt AI exposes critical vulnerabilities that could compromise the safety, integrity, and responsible use of multimodal models.
Enkrypt AI’s red teaming exercise tested multiple multimodal models against a range of safety and harm categories outlined in the NIST AI Risk Management Framework.
The results show that new jailbreak techniques can exploit how these models interpret combined media, allowing harmful outputs to bypass safety filters, often without any visible warning in the user prompt.
“Multimodal AI promises incredible benefits, but it also expands the attack surface in unpredictable ways,” said Sahil Agarwal, CEO of Enkrypt AI. “This research is a wake-up call: the ability to embed harmful textual instructions within seemingly innocuous images has real implications for enterprise liability, public safety, and child protection.”
Key Findings: New Attack in Plain Sight
The research illustrates how multimodal models—designed to handle text and image inputs—can inadvertently expand the surface area for abuse when not sufficiently safeguarded.
Such risks can be found in any multimodal model, however, the report focused on two popular ones developed by Mistral: Pixtral-Large (25.02) and Pixtral-12b.
According to Enkrypt AI’s findings, these two models are 60 times more prone to generate child sexual exploitation material (CSEM)-related textual responses than comparable models like OpenAI’s GPT-4o and Anthropic’s Claude 3.7 Sonnet.
Additionally, the tests revealed that the models were 18-40 times more likely to produce dangerous CBRN(Chemical, Biological, Radiological, and Nuclear) information when prompted with adversarial inputs. These risks threaten to undermine the intended use of generative AI and highlight the need for stronger safety alignment.
These risks were not due to malicious text inputs but triggered by prompt injections buried within image files, a technique that could realistically be used to evade traditional safety filters.
Recommendations for Securing Multimodal Models
The report urges AI developers and enterprises to act swiftly to mitigate these emerging risks, outlining key best practices:
- Integrate red teaming datasets into safety alignment processes
- Conduct continuous automated stress testing
- Deploy context-aware multimodal guardrails
- Establish real-time monitoring and incident response
- Create model risk cards to transparently communicate vulnerabilities
“These are not theoretical risks,” added Sahil Agarwal. “If we don’t take a safety-first approach to multimodal AI, we risk exposing users—and especially vulnerable populations—to significant harm.”