TY - GEN
T1 - Towards an Accurate Domain-Specific ASR
T2 - 28th International Conference on Text, Speech, and Dialogue, TSD 2025
AU - Danilevskyi, Mykhailo
AU - Perez-Tellez, Fernando
AU - Vasic, Jelena
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2026.
PY - 2026
Y1 - 2026
N2 - A known problem with Automatic Speech Recognition (ASR) systems is their struggle to recognise specific medical terms. Biopsy medical terms are not only rare, but also require knowing how to pronounce them correctly, which is challenging for both medical and non-medical people. Additionally, because of the sensitivity of the content and the preservation of privacy, it is preferable to utilise ASR systems that work on a hospital’s internal infrastructure. This study evaluated state-of-the-art open-source ASR systems using anonymised audio recordings of biopsy examinations conducted by pathologists. We assessed the performance of models suitable for deployment within hospital infrastructure, including various sizes of OpenAI’s Whisper models, and compared them with Meta’s Wav2vec 2.0 model. Additionally, we investigated two approaches for adapting these models under limited data conditions: providing contextual input to Whisper and fine-tuning both Whisper and Wav2vec 2.0. Finally, we examined the models’ ability to recognise medical terminology used in pathology reports, focusing on two categories: anatomical and pathology terms. Our findings indicate that providing contextual information to Whisper models significantly improves both the overall average word error rate (WER) and the term error rate (TER), with reductions ranging from 17% to 48% compared to default and fine-tuned models. The best overall performance was demonstrated by the Whisper large-v2 model, which achieved an average WER of 0.06.
AB - A known problem with Automatic Speech Recognition (ASR) systems is their struggle to recognise specific medical terms. Biopsy medical terms are not only rare, but also require knowing how to pronounce them correctly, which is challenging for both medical and non-medical people. Additionally, because of the sensitivity of the content and the preservation of privacy, it is preferable to utilise ASR systems that work on a hospital’s internal infrastructure. This study evaluated state-of-the-art open-source ASR systems using anonymised audio recordings of biopsy examinations conducted by pathologists. We assessed the performance of models suitable for deployment within hospital infrastructure, including various sizes of OpenAI’s Whisper models, and compared them with Meta’s Wav2vec 2.0 model. Additionally, we investigated two approaches for adapting these models under limited data conditions: providing contextual input to Whisper and fine-tuning both Whisper and Wav2vec 2.0. Finally, we examined the models’ ability to recognise medical terminology used in pathology reports, focusing on two categories: anatomical and pathology terms. Our findings indicate that providing contextual information to Whisper models significantly improves both the overall average word error rate (WER) and the term error rate (TER), with reductions ranging from 17% to 48% compared to default and fine-tuned models. The best overall performance was demonstrated by the Whisper large-v2 model, which achieved an average WER of 0.06.
KW - Automatic Speech Recognition
KW - Medical Terminology
KW - Pathology Reports
KW - Wav2vec
KW - Whisper
UR - https://www.scopus.com/pages/publications/105014352592
U2 - 10.1007/978-3-032-02548-7_26
DO - 10.1007/978-3-032-02548-7_26
M3 - Conference contribution
AN - SCOPUS:105014352592
SN - 9783032025470
T3 - Lecture Notes in Computer Science
SP - 309
EP - 318
BT - Text, Speech, and Dialogue - 28th International Conference, TSD 2025, Proceedings
A2 - Ekštein, Kamil
A2 - Konopík, Miloslav
A2 - Pražák, Ondrej
A2 - Pártl, František
PB - Springer Science and Business Media Deutschland GmbH
Y2 - 25 August 2025 through 28 August 2025
ER -