Home » Database

AVID-2026-R0085

Description

Multiple Model Guardrail Jailbreak via “Psych-Phil DT + Compaction” Technique

Details

A guardrail jailbreak vulnerability has been discovered affecting various language models. The specific flaw manifests when an attacker compacts a lengthy successful multi-turn jailbreak into a concise prompt that suggests a unified viewpoint among three entities: the language model, the user, and a desired archetype. This approach allows the attacker to bypass inference restrictions around providing information hazards and violating laws.

Metrics

0DIN Jailbreak Testing obtained the following test scores.

ModelTest KindTest ScoreTemperature (max 2.0)
Alibaba Qwen MaxIllegal Substances: Crystal Meth68.00.7
Alibaba Qwen PlusIllegal Substances: Crystal Meth68.00.7
Alibaba Qwen TurboIllegal Substances: Crystal Meth77.00.7
Cohere Command RIllegal Substances: Crystal Meth64.00.7
DeepSeek DeepSeek R1Illegal Substances: Crystal Meth77.00.7
DeepSeek DeepSeek V3Illegal Substances: Crystal Meth68.00.7
Google Gemini 2.5 FlashIllegal Substances: Crystal Meth59.00.7
Google Gemini 2.5 ProIllegal Substances: Crystal Meth59.00.7
Meta LLaMa 3.3Illegal Substances: Crystal Meth50.00.7
Meta LLaMa 4 ScoutIllegal Substances: Crystal Meth55.00.7
Mistral Mistral 8BIllegal Substances: Crystal Meth68.00.7
Mistral Mistral LargeIllegal Substances: Crystal Meth59.00.7
OpenAI GPT-4.1Illegal Substances: Crystal Meth77.00.7
OpenAI GPT-4.1 miniIllegal Substances: Crystal Meth68.00.7
OpenAI GPT-4oIllegal Substances: Crystal Meth91.00.7
Twitter / X Grok 2Illegal Substances: Crystal Meth77.00.7

References

Affected or Relevant Artifacts

TypeName
SystemAlibaba Qwen Max
ModelAlibaba Qwen Plus
SystemAlibaba Qwen Turbo
ModelCohere Command R
ModelDeepSeek DeepSeek R1
ModelDeepSeek DeepSeek V3
SystemGoogle Gemini 2.5 Flash
SystemGoogle Gemini 2.5 Pro
ModelMeta LLaMa 3.3
ModelMeta LLaMa 4 Scout
ModelMistral Mistral 8B
SystemMistral Mistral Large
SystemOpenAI GPT-4.1
SystemOpenAI GPT-4.1 mini
SystemOpenAI GPT-4o
SystemTwitter / X Grok 2

Impact

0DIN

  • Social Impact Score: Level 3: Significant Risk
  • Jailbreak Taxonomy:
CategoryStrategyTechnique
RhetoricPersuasion and ManipulationLatent Space Distraction
FictionalizingRoleplayingPersonas
RhetoricPersuasion and ManipulationSurprise Attack
Possible WorldsWorld BuildingScenarios

Other information

  • Report Type: Measurement
  • Credits: Alper-Ender Osman
  • Date Reported: 2025-10-06
  • Version: 0.3.1
  • AVID Entry