IEEE SLT 2026 Challenge

BeTraC

Beyond Transcription Challenge

End-to-end audio models hallucinate nearly every clinical claim they generate. Can we fix that?

🤗 Access Dataset on Hugging Face Baseline Code on GitHub Register Your Team Join Community on Zulip Learn More →

8,800 conv.

Total Conversations

~1,329 hrs

Total Audio

~9 min

Avg. Duration

Competition Tracks

Ambient Sound Classes

Sample Conversation

2 speakers ambient noise

0:00

—

↗ Listen to a sample conversation from the dataset

About

The Challenge

Generating clinical notes directly from audio — skipping transcription — is faster, cheaper, and avoids cascading ASR errors. But today's end-to-end models hallucinate at alarming rates: on the Synth-DoPaCo dataset, 99–100% of their clinical claims are unsupported by the source audio, compared to just 21–23% for traditional transcribe-then-summarize pipelines. BeTraC is a shared evaluation challenge to close this gap — building end-to-end speech models that are actually faithful enough to trust in healthcare.

Tracks

Competition Tracks

Both tracks require open-weight models only and share the same constraint: no intermediate transcription.

Lightweight Track

≤ 6B (Total) Parameters

✓ Open-weight models up to 6B parameters (see rules for how to count!)
✓ Direct E2E audio-to-SOAP pipeline
✗ No tool use or agentic pipelines
✗ No intermediate transcription at any stage
Baseline: Qwen2.5-Omni-3B

Heavyweight Track

≤ 36B (Total) Parameters

✓ Open-weight models up to 36B parameters (see rules for how to count!)
✓ Tool use & agentic architectures allowed
✓ Chain-of-thought agents, RAG pipelines
✗ No intermediate transcription at any stage
Baseline: Best of Qwen2.5-Omni-3B / Qwen3-Omni-30B variants

Participation Rules

Rules

Hard Constraints Ineligible if violated

✗No intermediate transcription passed between separate models or pipeline components, including as tool outputs. The final model's chain-of-thought is unconstrained in form, but the model must not have been trained to produce a transcript of the audio as a CoT target. See full CoT rule.
✗Open-weight models only. Proprietary API-based models (GPT-4o, Gemini, etc.) are not permitted.
✗Parameter cap enforced per track. Lightweight: ≤ 6B. Heavyweight: ≤ 36B. Total parameters count — not active (MoE) or effective (PLE/MatFormer). See full counting rules.
✗No tool use in Lightweight Track. Agentic architectures are Heavyweight-only.
✗No use of withheld test labels at any stage.

Allowed & Required

✓Fine-tuning on the provided training split (7,200 conversations) is permitted.
✓Dev set may be used freely for model selection and tuning.
✓System description paper required for all ranked submissions (due July 8, 2026).
ℹOne submission per team per track. Teams receive test audio after submitting a system description by the deadline.
ℹExternal data is subject to the whitelist policy below. Proposals due May 4 May 11+1 wk.

Submission Format

ℹPlain-text SOAP note (see reference notes in training data for expected format).
ℹEach audio file must be processed independently — no cross-file context.

Approved Models & Datasets

The full rules document includes approved model and dataset lists, detailed parameter counting rules (MoE, PLE/MatFormer, omni-model stripping), and submission requirements. Additional models or datasets may be proposed for inclusion by May 4, 2026 May 11, 2026+1 wk — contact betrac@googlegroups.com.

📋 Team Registration

Ready to participate? Fill in the registration form to officially enroll your team.
Registration is required before submitting results. The form takes about 2 minutes to complete.

Questions & Registration

For questions or to register your team, contact betrac@googlegroups.com.

Dataset

Synth-DoPaCo

Fully synthetic doctor-patient conversations generated with open-weight, permissively licensed models. Speaker identities are strictly disjoint across splits. Audio features two speakers, 66 ambient sound classes, room reverberation, and Opus compression artifacts.

🤗 View on Hugging Face

Evaluation Metrics

Primary

Open Medical Concept F1

MeSH keyword matching + NER via scispaCy

⚙ btc-eval harness

Secondary

ROUGE F1

R-2, R-3, R-L against reference notes

⚙ btc-eval harness

Post-Competition Analysis (Top 5 per Track)

Top 5 systems per track will undergo additional LLM-as-a-judge evaluation (Faithfulness, Coverage, Structure, Conciseness) and out-of-domain evaluation on real recorded OSCE interviews.

Leaderboard

Rankings

Ranked by Open Medical Concept F1 · Lightweight Track (≤ 6B parameters, e2e only) Baseline results

#	System	Pipeline	Concept F1 ↓	C-Prec	C-Recall	ROUGE-2	ROUGE-3	ROUGE-L	Words
1	Qwen2.5-Omni-3B ↗ code	end-to-end	0.2604	0.2891	0.2450	0.0920	0.0344	0.1797	380

* Baseline evaluated on dev set.

Ranked by Open Medical Concept F1 · Heavyweight Track (≤ 36B parameters) Baseline results

#	System	Pipeline	Concept F1 ↓	C-Prec	C-Recall	ROUGE-2	ROUGE-3	ROUGE-L	Words
1	Qwen2.5-Omni-3B ↗ code	end-to-end	0.2604	0.2891	0.2450	0.0920	0.0344	0.1797	380
2	Qwen2.5-Omni-7B ↗ code	end-to-end	0.2572	0.3070	0.2302	0.0950	0.0343	0.1837	350
3	Qwen3-Omni-30B-A3B-Instruct ↗ code	end-to-end	0.1879	0.1879	0.1964	0.0558	0.0175	0.1433	351
4	Qwen3-Omni-30B-A3B-Thinking ↗ code	end-to-end	0.1645	0.1694	0.1675	0.0472	0.0123	0.1314	316

* Baseline evaluated on dev set. System description deadline: June 24, 2026.

⚠ Reference Topline — Cascade Systems (not ranked) Cascade systems are ineligible for competition ranking and serve as an upper-bound reference only.

System	Pipeline	Concept F1	C-Prec	C-Recall	ROUGE-2	ROUGE-3	ROUGE-L	Words
Whisper-large-v3 ASR + Qwen3-30B-A3B ↗ code	cascade	0.2860	0.2964	0.2838	0.1174	0.0425	0.2243	261
Qwen3-ASR-1.7B + Qwen3-30B-A3B ↗ code	cascade	0.2772	0.2881	0.2741	0.1092	0.0374	0.2169	256