IEEE SLT 2026 Challenge

BeTraC

Beyond Transcription Challenge

Can end-to-end audio models reason faithfully over long clinical conversations — without ever producing a transcript?BeTraC challenges participants to generate structured SOAP notes directly from raw doctor-patient audio.

🤗 Access Dataset on Hugging Face Learn More →

8,800 conv.

Total Conversations

~1,329 hrs

Total Audio

~9 min

Avg. Duration

Competition Tracks

Ambient Sound Classes

About the Challenge

Bridging the Gap in
End-to-End Audio Reasoning

On Synth-DoPaCo, E2E models exhibit hallucination rates of 99–100% versus 21–23% for cascaded ASR pipelines — despite identical architectures. BeTraC targets this gap directly.

Standardized Evaluation

A reproducible benchmark for long-form E2E audio reasoning in clinical settings.

Quantify the Gap

Measure the delta between E2E and cascaded systems under controlled conditions.

Stimulate Research

Drive new architectures and training strategies that close the hallucination gap.

Assess Generalization

Test transfer from synthetic to real clinical audio via post-competition evaluation.

Competition Tracks

Two Tracks,
One Hard Rule

Both tracks require open-weight models only and share the same constraint: no intermediate transcription.
Systems must reason directly from raw audio to structured SOAP notes.

⚡

Lightweight Track

≤ 3B Parameters

✓ Open-weight models up to 3B parameters
✓ Direct E2E audio-to-SOAP pipeline
✗ No tool use or agentic pipelines
✗ No intermediate transcription at any stage
📌 Baseline: Qwen2.5-Omni-3B

🔬

Heavyweight Track

≤ 30B Parameters

✓ Open-weight models up to 30B parameters
✓ Tool use & agentic architectures allowed
✓ Chain-of-thought agents, RAG pipelines
✗ No intermediate transcription at any stage
📌 Baseline: Best of Qwen2.5-Omni-3B / Qwen3-Omni-30B variants

Reference Topline: A cascaded system (Whisper Large V3 + Qwen3-32B-Thinking) will be reported as an upper-bound reference. Cascaded systems are not eligible for competition ranking.

Participation Rules

Precise Rules &
Eligibility

The following rules apply to all BeTraC submissions. Detailed guidelines will be published on the challenge website before the April 2 data release.

🚫 Hard Constraints Ineligible if violated

✗No intermediate transcription at any pipeline stage — including chain-of-thought steps or tool outputs.
✗Open-weight models only. Proprietary API-based models (GPT-4o, Gemini, etc.) are not permitted.
✗Parameter cap enforced per track. Lightweight: ≤ 3B. Heavyweight: ≤ 30B. For MoE models, total (not active) parameter count applies.
✗No tool use in Lightweight Track. Agentic architectures are Heavyweight-only.
✗No use of withheld test labels at any stage.

✅ Allowed & Required

✓Fine-tuning on the provided training split (7,200 conversations) is permitted.
✓Dev set may be used freely for model selection and tuning.
✓System description paper required for all ranked submissions (due July 8, 2026).
✓Multiple submissions per team are allowed; the best result counts.
ℹExternal data is subject to the whitelist policy below. Proposals due May 4.

📋 Submission Format

ℹPlain-text SOAP note with clearly labeled sections: S, O, A, P.
ℹEach audio file must be processed independently — no cross-file context.

⚠️ Under Discussion — Model & Data Whitelist Policy

Whether to operate a formal whitelist of approved models and/or datasets is still being finalized by the organizing committee. This section will be updated once a decision is reached. Participants are encouraged to check back before the training data release on April 2, 2026, or contact the organizers directly.

Approved Models

TBD — whitelist policy under discussion

Approved External Datasets

TBD — whitelist policy under discussion

📬 Questions & Registration

For questions about eligibility, rule clarifications, or to register your team's intent to participate, contact the organizing committee at betrac@googlegroups.com. Full submission instructions will be published alongside the data release on April 2, 2026.

Dataset

Synth-DoPaCo 📄 Paper

Fully synthetic doctor-patient conversations generated with open-weight, permissively licensed models. Speaker identities are strictly disjoint across splits. Audio features two speakers, 66 ambient sound classes, room reverberation, and Opus compression artifacts.

🤗 View on Hugging Face

✍🏻 Evaluation Metrics

Primary

Open Medical Concept F1

MeSH keyword matching + NER via scispaCy

⚙ btc-eval harness

Secondary

ROUGE F1

R-2, R-3, R-L against reference notes

⚙ btc-eval harness

Post-Competition Analysis (Top 5 per Track)

LLM-as-a-Judge

Reference-free scoring across 4 dimensions (1–5): Faithfulness, Coverage, Structure, and Conciseness. Additional analyses include over-/under-medicalization, over-specificity, missed facts, critical omissions, duplicated content, unsupported claims, and contradictions.

Out-of-Domain Generalization

Evaluated on 20–30 real OSCE interviews (Fareez et al.) to test transfer from synthetic to real recorded audio.

Leaderboard

Rankings

Ranked by Open Medical Concept F1 (primary), with ROUGE-2/3/L as secondary.
Winners determined separately per track.

Ranked by Open Medical Concept F1 · Lightweight Track (≤ 3B parameters) ⏳ Submissions open June 24

#	Team	System / Model	Med Concept F1 ↓	ROUGE-2	ROUGE-L
🏁 No submissions yet Results will appear here after the submission deadline (June 24, 2026).

* Baseline (Qwen2.5-Omni-3B) will be published alongside training data release on April 2, 2026.

Ranked by Open Medical Concept F1 · Heavyweight Track (≤ 30B parameters) ⏳ Submissions open June 24

#	Team	System / Model	Med Concept F1 ↓	ROUGE-2	ROUGE-L
🏁 No submissions yet Results will appear here after the submission deadline (June 24, 2026).

* Baseline (best of Qwen2.5-Omni-3B / Qwen3-Omni-30B variants) will be published alongside training data release on April 2, 2026.

Schedule

Key Milestones

April 2, 2026

Training + Dev Data Release

Audio + reference SOAP notes published on Hugging Face

May 4, 2026

Open-Source Proposals Deadline

Participants proposing additional open-source models or data

June 24, 2026

System Submission Deadline

Final submissions for Lightweight and Heavyweight tracks

July 1, 2026

Results Announcement

Automated metric results released for all submissions

July 8, 2026

Challenge Paper Submission

System description papers due for IEEE SLT 2026 proceedings

Team

Organizers

Andrew Perrault

Assistant Professor of CSE

The Ohio State University

Jiyun (Amy) Chun

Ph.D. Student

The Ohio State University

Samuele Cornell

Postdoctoral Research Associate

Carnegie Mellon University

Siddhant Arora

Ph.D. Student

Carnegie Mellon University

Syed-Amad Hussain

Ph.D. Candidate

The Ohio State University

Nationwide Children's Hospital

Thomas Schaaf

Principal Research Scientist

Solventum / CMU LTI

Markus Müller

Senior Applied Scientist

Amazon

Leibny Paola Garcia

Assistant Research Scientist

Johns Hopkins University (CLSP)

Ahmed Hassoon (MD, MPH)

Assistant Research Professor

Johns Hopkins University Bloomberg School of Public Health

Contact For general inquiries, rule clarifications, and team registration, reach the organizing committee at betrac@googlegroups.com

BeTraC

Bridging the Gap inEnd-to-End Audio Reasoning

Standardized Evaluation

Quantify the Gap

Stimulate Research

Assess Generalization

Two Tracks,One Hard Rule

Precise Rules &Eligibility

Synth-DoPaCo 📄 Paper

Rankings

Key Milestones

Organizers

Bridging the Gap in
End-to-End Audio Reasoning

Two Tracks,
One Hard Rule

Precise Rules &
Eligibility