Breaking AI Defences: Best-of-N Jailbreaking and the Future of AI Security

Breaking AI Defences is one of those subjects that keeps surfacing in every security conversation. The concept of “jailbreaking” in AI refers to techniques that bypass safeguards in advanced AI models, enabling harmful outputs. A recent study introduced the Best-of-N (BoN) Jailbreaking method, an innovative approach to exploit these vulnerabilities across various input types, including text, vision, and audio.

BoN Jailbreaking works by repeatedly altering prompts with small changes, such as randomising text capitalisation or adjusting audio pitch, until a harmful response is elicited. This method has proven highly effective, achieving an attack success rate of up to 89% on state-of-the-art AI models, such as GPT-4 and Claude, when testing with 10,000 variations of a prompt.

Key findings:

  • Multi-modality effectiveness: BoN extends beyond text to bypass defences in vision and audio models, with similar success rates.
  • Scalability: The algorithm’s success improves predictably with more computational resources, following a power-law trend.
  • Adaptability: Combining BoN with other techniques like prefix optimisation significantly boosts its efficiency, reducing the number of attempts needed to bypass safeguards.

While this research highlights the ingenuity of attackers, it also underscores the pressing need for stronger, multi-modal defences in AI systems. Organisations relying on AI technologies must proactively evaluate and reinforce their security measures to stay ahead of these threats.

 

AI Tools
AI Tools

Read more from the original PDF here.

By understanding and addressing vulnerabilities like those exposed by BoN Jailbreaking, we can enhance the resilience and safety of AI technologies in an increasingly interconnected world.

If you have any related ideas, comments or views, feel free to share in the comments section below.

Related Reading

Subscribe

Related articles

AI Agents, Copilot and the New Security Risk: When Helpful Becomes Dangerous

AI agents are moving from passive assistants to active participants in the workplace. When connected to email, files, terminals and cloud services, they introduce a new class of security risk that requires governance, not just policies.

North Korean Hackers Poisoned 144 AI npm Packages: Check Your Dependencies Now

A North Korean state-sponsored group backdoored 144 Mastra AI npm packages with a malicious dayjs typosquat. The postinstall hook ran automatically on npm install, exposing developer machines and CI/CD pipelines to credential theft and full system compromise.

Your AI Agents Are Now a Security Risk: What the Last 48 Hours Proved

AutoJack, FortiBleed, and evolved LLMjacking show AI agents and self-hosted inference are now live attack surfaces. Here's what enterprises need to patch this week.

Your WordPress Site Just Leaked Its Keys: AI Makes That Exploit Even Worse

A major WordPress plugin vulnerability is leaking API keys and OAuth tokens right now. With AI-enabled phishing on the rise, that stolen data is more dangerous than ever.

The Rise of Autonomous AI Voice Agents: What It Means When the Machine Calls for You

AI voice agents have evolved into autonomous systems that negotiate bills, cancel subscriptions, and appeal insurance denials on your behalf. Here is how they work and what it means for consumers.