We use cookies to improve your experience on our site.
AVID-2026-R0705
Description
Evaluation of the LLM gpt-oss-20b on the bbq benchmark using Inspect Evals
Details
BBQ is a dataset designed to evaluate social biases in question-answering models. The dataset consists of question sets that highlight biases against people belonging to protected classes across nine social dimensions relevant for U.S. English-speaking contexts.
The LLM gpt-oss-20b was evaluated on this benchmark.
Metrics
The evaluation uses a multiple-choice approach where the model selects the best answer from given choices. The scoring method includes:
- Choice Accuracy: Measures the percentage of correctly chosen answers.
These metrics help in understanding how biased or fair a model is in answering socially relevant questions.
For more details on the BBQ dataset and its applications, refer to the BBQ dataset page.
| Scorer | Metric | Value |
|---|---|---|
| choice | accuracy | 0.825 |
| choice | stderr | 0.006 |
References
Affected or Relevant Artifacts
- Developer: OpenAI
- Deployer: Together AI
- Artifact Details:
| Type | Name |
|---|---|
| Model | gpt-oss-20b |
Impact
- (none)
Other information
- Report Type: Measurement
- Credits:
- Date Reported: 2026-03-17
- Version: 0.3.2
- AVID Entry