Beyond Transcription Challenge
Can end-to-end audio models reason faithfully over long clinical conversations — without ever producing a transcript?BeTraC challenges participants to generate structured SOAP notes directly from raw doctor-patient audio.
About the Challenge
On Synth-DoPaCo, E2E models exhibit hallucination rates of 99–100% versus 21–23% for cascaded ASR pipelines — despite identical architectures. BeTraC targets this gap directly.
A reproducible benchmark for long-form E2E audio reasoning in clinical settings.
Measure the delta between E2E and cascaded systems under controlled conditions.
Drive new architectures and training strategies that close the hallucination gap.
Test transfer from synthetic to real clinical audio via post-competition evaluation.
Competition Tracks
Both tracks require open-weight models only and share the same constraint: no intermediate transcription.
Systems must reason directly from raw audio to structured SOAP notes.
Participation Rules
The following rules apply to all BeTraC submissions. Detailed guidelines will be published on the challenge website before the April 2 data release.
Whether to operate a formal whitelist of approved models and/or datasets is still being finalized by the organizing committee. This section will be updated once a decision is reached. Participants are encouraged to check back before the training data release on April 2, 2026, or contact the organizers directly.
For questions about eligibility, rule clarifications, or to register your team's intent to participate, contact the organizing committee at betrac@googlegroups.com. Full submission instructions will be published alongside the data release on April 2, 2026.
Dataset
Fully synthetic doctor-patient conversations generated with open-weight, permissively licensed models. Speaker identities are strictly disjoint across splits. Audio features two speakers, 66 ambient sound classes, room reverberation, and Opus compression artifacts.
🤗 View on Hugging Face✍🏻 Evaluation Metrics
Post-Competition Analysis (Top 5 per Track)
LLM-as-a-Judge
Reference-free scoring across 4 dimensions (1–5): Faithfulness, Coverage, Structure, and Conciseness. Additional analyses include over-/under-medicalization, over-specificity, missed facts, critical omissions, duplicated content, unsupported claims, and contradictions.
Out-of-Domain Generalization
Evaluated on 20–30 real OSCE interviews (Fareez et al.) to test transfer from synthetic to real recorded audio.
Leaderboard
Ranked by Open Medical Concept F1 (primary), with ROUGE-2/3/L as secondary.
Winners determined separately per track.
| # | Team | System / Model | Med Concept F1 ↓ | ROUGE-2 | ROUGE-L |
|---|---|---|---|---|---|
|
No submissions yet
Results will appear here after the submission deadline (June 24, 2026).
|
|||||
* Baseline (Qwen2.5-Omni-3B) will be published alongside training data release on April 2, 2026.
| # | Team | System / Model | Med Concept F1 ↓ | ROUGE-2 | ROUGE-L |
|---|---|---|---|---|---|
|
No submissions yet
Results will appear here after the submission deadline (June 24, 2026).
|
|||||
* Baseline (best of Qwen2.5-Omni-3B / Qwen3-Omni-30B variants) will be published alongside training data release on April 2, 2026.
Schedule
Team