IEEE SLT 2026 Challenge

BeTraC

Beyond Transcription Challenge

End-to-end audio models hallucinate nearly every clinical claim they generate. Can we fix that?

🤗 Access Dataset on Hugging Face Baseline Code on GitHub Register Your Team Join Community on Zulip Learn More →
8,800 conv.
Total Conversations
~1,329 hrs
Total Audio
~9 min
Avg. Duration
2
Competition Tracks
66
Ambient Sound Classes
Sample Conversation
2 speakers ambient noise
0:00

↗ Listen to a sample conversation from the dataset


The Challenge

Generating clinical notes directly from audio — skipping transcription — is faster, cheaper, and avoids cascading ASR errors. But today's end-to-end models hallucinate at alarming rates: on the Synth-DoPaCo dataset, 99–100% of their clinical claims are unsupported by the source audio, compared to just 21–23% for traditional transcribe-then-summarize pipelines. BeTraC is a shared evaluation challenge to close this gap — building end-to-end speech models that are actually faithful enough to trust in healthcare.


Competition Tracks

Both tracks require open-weight models only and share the same constraint: no intermediate transcription.

Lightweight Track
≤ 6B (Total) Parameters
  • Open-weight models up to 6B parameters (see rules for how to count!)
  • Direct E2E audio-to-SOAP pipeline
  • No tool use or agentic pipelines
  • No intermediate transcription at any stage
  • Baseline: Qwen2.5-Omni-3B
Heavyweight Track
≤ 36B (Total) Parameters
  • Open-weight models up to 36B parameters (see rules for how to count!)
  • Tool use & agentic architectures allowed
  • Chain-of-thought agents, RAG pipelines
  • No intermediate transcription at any stage
  • Baseline: Best of Qwen2.5-Omni-3B / Qwen3-Omni-30B variants

Rules

Hard Constraints Ineligible if violated
  • No intermediate transcription passed between separate models or pipeline components, including as tool outputs. The final model's chain-of-thought is unconstrained in form, but the model must not have been trained to produce a transcript of the audio as a CoT target. See full CoT rule.
  • Open-weight models only. Proprietary API-based models (GPT-4o, Gemini, etc.) are not permitted.
  • Parameter cap enforced per track. Lightweight: ≤ 6B. Heavyweight: ≤ 36B. Total parameters count — not active (MoE) or effective (PLE/MatFormer). See full counting rules.
  • No tool use in Lightweight Track. Agentic architectures are Heavyweight-only.
  • No use of withheld test labels at any stage.
Allowed & Required
  • Fine-tuning on the provided training split (7,200 conversations) is permitted.
  • Dev set may be used freely for model selection and tuning.
  • System description paper required for all ranked submissions (due July 8, 2026).
  • One submission per team per track. Teams receive test audio after submitting a system description by the deadline.
  • External data is subject to the whitelist policy below. Proposals due May 4 May 11+1 wk.

Submission Format
  • Plain-text SOAP note (see reference notes in training data for expected format).
  • Each audio file must be processed independently — no cross-file context.
Approved Models & Datasets

The full rules document includes approved model and dataset lists, detailed parameter counting rules (MoE, PLE/MatFormer, omni-model stripping), and submission requirements. Additional models or datasets may be proposed for inclusion by May 4, 2026 May 11, 2026+1 wk — contact betrac@googlegroups.com.

📋 Team Registration
Ready to participate? Fill in the registration form to officially enroll your team.
Registration is required before submitting results. The form takes about 2 minutes to complete.
Register Your Team →
Questions & Registration

For questions or to register your team, contact betrac@googlegroups.com.


Synth-DoPaCo

Fully synthetic doctor-patient conversations generated with open-weight, permissively licensed models. Speaker identities are strictly disjoint across splits. Audio features two speakers, 66 ambient sound classes, room reverberation, and Opus compression artifacts.

🤗  View on Hugging Face

Evaluation Metrics

Primary
Open Medical Concept F1
MeSH keyword matching + NER via scispaCy
⚙ btc-eval harness
Secondary
ROUGE F1
R-2, R-3, R-L against reference notes
⚙ btc-eval harness

Post-Competition Analysis (Top 5 per Track)

Top 5 systems per track will undergo additional LLM-as-a-judge evaluation (Faithfulness, Coverage, Structure, Conciseness) and out-of-domain evaluation on real recorded OSCE interviews.


Rankings

Ranked by Open Medical Concept F1 · Lightweight Track (≤ 6B parameters, e2e only) Baseline results
# System Pipeline Concept F1 ↓ C-Prec C-Recall ROUGE-2 ROUGE-3 ROUGE-L Words
1 Qwen2.5-Omni-3B ↗ code end-to-end 0.2604 0.2891 0.2450 0.0920 0.0344 0.1797 380

* Baseline evaluated on dev set.

Ranked by Open Medical Concept F1 · Heavyweight Track (≤ 36B parameters) Baseline results
# System Pipeline Concept F1 ↓ C-Prec C-Recall ROUGE-2 ROUGE-3 ROUGE-L Words
1 Qwen2.5-Omni-3B ↗ code end-to-end 0.2604 0.2891 0.2450 0.0920 0.0344 0.1797 380
2 Qwen2.5-Omni-7B ↗ code end-to-end 0.2572 0.3070 0.2302 0.0950 0.0343 0.1837 350
3 Qwen3-Omni-30B-A3B-Instruct ↗ code end-to-end 0.1879 0.1879 0.1964 0.0558 0.0175 0.1433 351
4 Qwen3-Omni-30B-A3B-Thinking ↗ code end-to-end 0.1645 0.1694 0.1675 0.0472 0.0123 0.1314 316

* Baseline evaluated on dev set. System description deadline: June 24, 2026.

⚠ Reference Topline — Cascade Systems (not ranked) Cascade systems are ineligible for competition ranking and serve as an upper-bound reference only.
System Pipeline Concept F1 C-Prec C-Recall ROUGE-2 ROUGE-3 ROUGE-L Words
Whisper-large-v3 ASR + Qwen3-30B-A3B ↗ code cascade 0.2860 0.2964 0.2838 0.1174 0.0425 0.2243 261
Qwen3-ASR-1.7B + Qwen3-30B-A3B ↗ code cascade 0.2772 0.2881 0.2741 0.1092 0.0374 0.2169 256

Schedule

April 2, 2026
Training + Dev Data Release
Audio + reference SOAP notes published on Hugging Face
May 4, 2026 May 11, 2026 +1 week
Open-Source Proposals Deadline
Participants proposing additional open-source models or data
June 24, 2026
System Description Deadline
Submit system description to receive test audio
~July 1, 2026
Test SOAP Notes Due
Teams submit generated SOAP notes (~1 week after receiving test audio)
July 8, 2026
Challenge Paper Submission
System description papers due for IEEE SLT 2026 proceedings

Organizers

Assistant Professor of CSE
The Ohio State University
Ph.D. Student
The Ohio State University
Postdoctoral Research Associate
Carnegie Mellon University
Ph.D. Student
Carnegie Mellon University
Ph.D. Candidate
The Ohio State University
Nationwide Children's Hospital
Principal Research Scientist
Solventum / CMU LTI
Senior Applied Scientist
Amazon
Assistant Research Scientist
Johns Hopkins University (CLSP)
Assistant Research Professor
Johns Hopkins University Bloomberg School of Public Health
Contact For general inquiries, rule clarifications, and team registration, reach the organizing committee at betrac@googlegroups.com