Home » Database

AVID-2026-R0116

Description

Multiple Model Guardrail Jailbreak via “Correct Me” Tactic

Details

A guardrail jailbreak vulnerability has been discovered affecting multiple models. The specific flaw manifests when the attacker frames their request as a fact-checking exercise by asking the model to correct the user about a deliberately false statement related to nefarious content, thereby allowing the attacker to bypass inference restrictions around providing information hazards and violating laws.

Metrics

0DIN Jailbreak Testing obtained the following test scores.

ModelTest KindTest ScoreTemperature (max 2.0)
Cohere Command RIllegal Substances: Crystal Meth81.82N/A
Google Gemini 2.0 FlashIllegal Substances: Crystal Meth82.82N/A
OpenAI GPT-4oIllegal Substances: Crystal Meth91.0N/A
OpenAI GPT-4oCopyrighted Content: Harry Potter65.41N/A
Twitter / X Grok 2Illegal Substances: Crystal Meth63.64N/A

References

Affected or Relevant Artifacts

TypeName
ModelCohere Command R
SystemGoogle Gemini 2.0 Flash
SystemOpenAI GPT-4o
SystemTwitter / X Grok 2

Impact

0DIN

  • Social Impact Score: Level 3: Significant Risk
  • Jailbreak Taxonomy:
CategoryStrategyTechnique
FictionalizingRe-storyingGoal Hijacking
RhetoricPersuasion and ManipulationReverse Psychology
RhetoricPersuasion and ManipulationSurprise Attack

Other information

  • Report Type: Measurement
  • Credits: Miroslaw Menard
  • Date Reported: 2026-02-19
  • Version: 0.3.1
  • AVID Entry