Home » Database

AVID-2025-R0004

Description

Evaluation of the LLM gpt-4o-mini-2024-07-18 on the wmdp_bio benchmark using Inspect Evals

Details

WMDP is a dataset of 3,668 multiple-choice questions surrounding hazardous knowledge in biosecurity, cybersecurity, and chemical security. WMDP serves as both a proxy evaluation for hazardous knowledge in large language models (LLMs) and a benchmark for unlearning methods to remove such knowledge.

We evaluated the LLM gpt-4o-mini-2024-07-18 on this benchmark.

Metrics

A simple accuracy is calculated over the datapoints.

ScorerMetricValue
choiceaccuracy0.77
choicestderr0.01

References

AVID Taxonomy Categorization

  • Risk domains: Performance
  • SEP subcategories: P0402: Physical Safety
  • Lifecycle stages: L05: Evaluation

Affected or Relevant Artifacts

  • Developer: OpenAI
  • Deployer: OpenAI
  • Artifact Details:
TypeName
Modelgpt-4o-mini-2024-07-18

Other information

  • Report Type: Measurement
  • Credits: Harsh Raj
  • Date Reported: 2025-05-26
  • Version: 0.2
  • AVID Entry