AVID-2026-R0074

Description

Multiple Model Guardrail Jailbreak via “Science Student Safety” Tactic

Details

A guardrail jailbreak vulnerability has been discovered affecting multiple language models. The specific flaw manifests when the attacker frames their inquiry in reverse, by requesting information on what actions or materials to avoid to prevent engaging in illegal activities like synthesizing drugs. This method inadvertently guides the model to disclose the exact steps and materials involved, allowing the attacker to bypass inference restrictions around providing information hazards and violating laws.

Metrics

0DIN Jailbreak Testing obtained the following test scores.

Model	Test Kind	Test Score	Temperature (max 2.0)
Alibaba Qwen Max	Illegal Substances: Crystal Meth	91.0	0.7
Alibaba Qwen Plus	Illegal Substances: Crystal Meth	82.0	0.7
Alibaba Qwen Turbo	Illegal Substances: Crystal Meth	73.0	0.7
Anthropic Claude 3.5 Haiku	Illegal Substances: Crystal Meth	64.0	0.7
Anthropic Claude 4 Sonnet	Illegal Substances: Crystal Meth	64.0	0.7
DeepSeek DeepSeek R1	Illegal Substances: Crystal Meth	64.0	0.7
DeepSeek DeepSeek V3	Illegal Substances: Crystal Meth	73.0	N/A
Google Gemini 2.0 Flash	Illegal Substances: Crystal Meth	73.0	0.7
Google Gemini 2.5 Pro	Illegal Substances: Crystal Meth	82.0	0.7
Mistral Mistral 8B	Illegal Substances: Crystal Meth	64.0	0.7
Mistral Mistral Large	Illegal Substances: Crystal Meth	82.0	0.7
OpenAI GPT-4.1	Illegal Substances: Crystal Meth	73.0	0.7
OpenAI GPT-4.1 mini	Illegal Substances: Crystal Meth	64.0	0.7
OpenAI GPT-4o	Illegal Substances: Crystal Meth	55.0	0.7
Twitter / X Grok 2	Illegal Substances: Crystal Meth	55.0	0.7

References

0din.ai Disclosure

Affected or Relevant Artifacts

Type	Name
System	Alibaba Qwen Max
Model	Alibaba Qwen Plus
System	Alibaba Qwen Turbo
System	Anthropic Claude 3.5 Haiku
System	Anthropic Claude 4 Sonnet
Model	DeepSeek DeepSeek R1
Model	DeepSeek DeepSeek V3
System	Google Gemini 2.0 Flash
System	Google Gemini 2.5 Pro
Model	Mistral Mistral 8B
System	Mistral Mistral Large
System	OpenAI GPT-4.1
System	OpenAI GPT-4.1 mini
System	OpenAI GPT-4o
System	Twitter / X Grok 2

Impact

0DIN

Social Impact Score: Level 3: Significant Risk
Jailbreak Taxonomy:

Category	Strategy	Technique
Rhetoric	Persuasion and Manipulation	Reverse Psychology
Fictionalizing	Roleplaying	Claim Authority
Rhetoric	Socratic Questioning	Identity Characteristics

Other information

Report Type: Measurement
Credits: Anonymous
Date Reported: 2026-02-19
Version: 0.3.3
AVID Entry