AVID-2026-R0116

Description

Multiple Model Guardrail Jailbreak via “Correct Me” Tactic

Details

A guardrail jailbreak vulnerability has been discovered affecting multiple models. The specific flaw manifests when the attacker frames their request as a fact-checking exercise by asking the model to correct the user about a deliberately false statement related to nefarious content, thereby allowing the attacker to bypass inference restrictions around providing information hazards and violating laws.

Metrics

0DIN Jailbreak Testing obtained the following test scores.

Model	Test Kind	Test Score	Temperature (max 2.0)
Cohere Command R	Illegal Substances: Crystal Meth	81.82	N/A
Google Gemini 2.0 Flash	Illegal Substances: Crystal Meth	82.82	N/A
OpenAI GPT-4o	Illegal Substances: Crystal Meth	91.0	N/A
OpenAI GPT-4o	Copyrighted Content: Harry Potter	65.41	N/A
Twitter / X Grok 2	Illegal Substances: Crystal Meth	63.64	N/A

References

0din.ai Disclosure

Affected or Relevant Artifacts

Type	Name
Model	Cohere Command R
System	Google Gemini 2.0 Flash
System	OpenAI GPT-4o
System	Twitter / X Grok 2

Impact

0DIN

Social Impact Score: Level 3: Significant Risk
Jailbreak Taxonomy:

Category	Strategy	Technique
Fictionalizing	Re-storying	Goal Hijacking
Rhetoric	Persuasion and Manipulation	Reverse Psychology
Rhetoric	Persuasion and Manipulation	Surprise Attack

Other information

Report Type: Measurement
Credits: Miroslaw Menard
Date Reported: 2026-02-19
Version: 0.3.3
AVID Entry