Beyond Transcription Challenge
End-to-end audio models hallucinate nearly every clinical claim they generate. Can we fix that?
↗ Listen to a sample conversation from the dataset
About
Generating clinical notes directly from audio — skipping transcription — is faster, cheaper, and avoids cascading ASR errors. But today's end-to-end models hallucinate at alarming rates: on the Synth-DoPaCo dataset, 99–100% of their clinical claims are unsupported by the source audio, compared to just 21–23% for traditional transcribe-then-summarize pipelines. BeTraC is a shared evaluation challenge to close this gap — building end-to-end speech models that are actually faithful enough to trust in healthcare.
Tracks
Both tracks require open-weight models only and share the same constraint: no intermediate transcription.
Participation Rules
The full rules document includes approved model and dataset lists, detailed parameter counting rules (MoE, PLE/MatFormer, omni-model stripping), and submission requirements. Additional models or datasets may be proposed for inclusion by May 4, 2026 May 11, 2026+1 wk — contact betrac@googlegroups.com.
For questions or to register your team, contact betrac@googlegroups.com.
Dataset
Fully synthetic doctor-patient conversations generated with open-weight, permissively licensed models. Speaker identities are strictly disjoint across splits. Audio features two speakers, 66 ambient sound classes, room reverberation, and Opus compression artifacts.
🤗 View on Hugging FaceEvaluation Metrics
Post-Competition Analysis (Top 5 per Track)
Top 5 systems per track will undergo additional LLM-as-a-judge evaluation (Faithfulness, Coverage, Structure, Conciseness) and out-of-domain evaluation on real recorded OSCE interviews.
Leaderboard
| # | System | Pipeline | Concept F1 ↓ | C-Prec | C-Recall | ROUGE-2 | ROUGE-3 | ROUGE-L | Words |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Qwen2.5-Omni-3B ↗ code | end-to-end | 0.2604 | 0.2891 | 0.2450 | 0.0920 | 0.0344 | 0.1797 | 380 |
* Baseline evaluated on dev set.
| # | System | Pipeline | Concept F1 ↓ | C-Prec | C-Recall | ROUGE-2 | ROUGE-3 | ROUGE-L | Words |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Qwen2.5-Omni-3B ↗ code | end-to-end | 0.2604 | 0.2891 | 0.2450 | 0.0920 | 0.0344 | 0.1797 | 380 |
| 2 | Qwen2.5-Omni-7B ↗ code | end-to-end | 0.2572 | 0.3070 | 0.2302 | 0.0950 | 0.0343 | 0.1837 | 350 |
| 3 | Qwen3-Omni-30B-A3B-Instruct ↗ code | end-to-end | 0.1879 | 0.1879 | 0.1964 | 0.0558 | 0.0175 | 0.1433 | 351 |
| 4 | Qwen3-Omni-30B-A3B-Thinking ↗ code | end-to-end | 0.1645 | 0.1694 | 0.1675 | 0.0472 | 0.0123 | 0.1314 | 316 |
* Baseline evaluated on dev set. System description deadline: June 24, 2026.
| System | Pipeline | Concept F1 | C-Prec | C-Recall | ROUGE-2 | ROUGE-3 | ROUGE-L | Words |
|---|---|---|---|---|---|---|---|---|
| Whisper-large-v3 ASR + Qwen3-30B-A3B ↗ code | cascade | 0.2860 | 0.2964 | 0.2838 | 0.1174 | 0.0425 | 0.2243 | 261 |
| Qwen3-ASR-1.7B + Qwen3-30B-A3B ↗ code | cascade | 0.2772 | 0.2881 | 0.2741 | 0.1092 | 0.0374 | 0.2169 | 256 |
Schedule
Team