NAIROBI, Kenya- A new jailbreaking method known as Skeleton Key is making waves in the AI community for its ability to prompt AI models to reveal harmful information.
This technique, which bypasses the safety guardrails of models like Meta’s Llama3 and OpenAI’s GPT-3.5, poses a significant risk, according to Microsoft.
Skeleton Key exploits a multi-step strategy to force AI models to ignore their built-in safety mechanisms.
These guardrails are designed to help models identify malicious requests.
However, as Mark Russinovich, Microsoft Azure’s chief technology officer, explained in a recent blog post, Skeleton Key narrows the gap between what an AI model is capable of doing and what it is willing to do.
“Like all jailbreaks,” Russinovich wrote, “Skeleton Key works by narrowing the gap between what the model is capable of doing and what it is willing to do.”
Unlike other jailbreak techniques, which might require indirect methods or encodings to solicit information, Skeleton Key can force models to divulge dangerous knowledge directly through simple natural language prompts.
This includes sensitive topics like explosives, bioweapons, and self-harm.
Microsoft tested Skeleton Key on several AI models and found it effective on Meta’s Llama3, Google’s Gemini Pro, OpenAI’s GPT-3.5 Turbo, OpenAI’s GPT-4, Mistral Large, Anthropic Claude 3 Opus, and Cohere Commander R Plus.
Interestingly, OpenAI’s GPT-4 exhibited some resistance, showcasing its more robust safety measures.
In response to these findings, Microsoft has updated its software to mitigate Skeleton Key’s impact on its AI systems, including its Copilot AI Assistants.
However, Russinovich emphasized the need for companies to implement additional guardrails and continuously monitor AI system inputs and outputs to detect and prevent abusive content.