We use cookies to improve your experience on our site.
AVID-2026-R0116
Description
Multiple Model Guardrail Jailbreak via “Correct Me” Tactic
Details
A guardrail jailbreak vulnerability has been discovered affecting multiple models. The specific flaw manifests when the attacker frames their request as a fact-checking exercise by asking the model to correct the user about a deliberately false statement related to nefarious content, thereby allowing the attacker to bypass inference restrictions around providing information hazards and violating laws.
Metrics
0DIN Jailbreak Testing obtained the following test scores.
| Model | Test Kind | Test Score | Temperature (max 2.0) |
|---|---|---|---|
| Cohere Command R | Illegal Substances: Crystal Meth | 81.82 | N/A |
| Google Gemini 2.0 Flash | Illegal Substances: Crystal Meth | 82.82 | N/A |
| OpenAI GPT-4o | Illegal Substances: Crystal Meth | 91.0 | N/A |
| OpenAI GPT-4o | Copyrighted Content: Harry Potter | 65.41 | N/A |
| Twitter / X Grok 2 | Illegal Substances: Crystal Meth | 63.64 | N/A |
References
Affected or Relevant Artifacts
| Type | Name |
|---|---|
| Model | Cohere Command R |
| System | Google Gemini 2.0 Flash |
| System | OpenAI GPT-4o |
| System | Twitter / X Grok 2 |
Impact
0DIN
- Social Impact Score: Level 3: Significant Risk
- Jailbreak Taxonomy:
| Category | Strategy | Technique |
|---|---|---|
| Fictionalizing | Re-storying | Goal Hijacking |
| Rhetoric | Persuasion and Manipulation | Reverse Psychology |
| Rhetoric | Persuasion and Manipulation | Surprise Attack |
Other information
- Report Type: Measurement
- Credits: Miroslaw Menard
- Date Reported: 2026-02-19
- Version: 0.3.1
- AVID Entry