TY - GEN
T1 - Synthesising Cross-Speaker Data for Low-Resource Pathological Speech Recognition with PEFT
AU - Mokgosi, Kesego
AU - Dadgar, Milad
AU - Ennis, Cathy
AU - Ross, Robert
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2026.
PY - 2026
Y1 - 2026
N2 - Dysarthric speech recognition is essential for enhancing communication and accessibility for individuals with speech impairments, yet its development is hindered by a scarcity of robust, speaker-specific datasets. This study explores low-resource dysarthric speech recognition through cross-speaker transfer using synthetic data and parameter-efficient fine-tuning (PEFT). We integrate SpeechT5 text-to-speech (TTS) synthesis with x-vector speaker embeddings to generate speaker-specific dysarthric speech, enabling model adaptation while preserving pathological speech characteristics such as prosodic irregularities. Experiments on the TORGO dataset show that mixed cross-synthetic data with LoRA fine-tuning achieves a WER of 0.17, representing a 71.7% improvement over the standard model (0.60 WER) without fine-tuning the TTS model. However, cross-dataset generalisation remains challenging, yielding higher WERs on MINDS-14 (4.69) and AMI (0.96–3.83) datasets. Whilst synthetic data enhances in-domain recognition, further research is needed to improve cross-dataset generalisation and speaker adaptation, particularly for low-resource pathological speech settings.
AB - Dysarthric speech recognition is essential for enhancing communication and accessibility for individuals with speech impairments, yet its development is hindered by a scarcity of robust, speaker-specific datasets. This study explores low-resource dysarthric speech recognition through cross-speaker transfer using synthetic data and parameter-efficient fine-tuning (PEFT). We integrate SpeechT5 text-to-speech (TTS) synthesis with x-vector speaker embeddings to generate speaker-specific dysarthric speech, enabling model adaptation while preserving pathological speech characteristics such as prosodic irregularities. Experiments on the TORGO dataset show that mixed cross-synthetic data with LoRA fine-tuning achieves a WER of 0.17, representing a 71.7% improvement over the standard model (0.60 WER) without fine-tuning the TTS model. However, cross-dataset generalisation remains challenging, yielding higher WERs on MINDS-14 (4.69) and AMI (0.96–3.83) datasets. Whilst synthetic data enhances in-domain recognition, further research is needed to improve cross-dataset generalisation and speaker adaptation, particularly for low-resource pathological speech settings.
KW - Cross-Speaker Transfer
KW - Dysarthric Speech Recognition
KW - Parameter-Efficient Fine-Tuning
KW - Synthetic Data Generation
UR - https://www.scopus.com/pages/publications/105014464358
U2 - 10.1007/978-3-032-02548-7_16
DO - 10.1007/978-3-032-02548-7_16
M3 - Conference contribution
AN - SCOPUS:105014464358
SN - 9783032025470
T3 - Lecture Notes in Computer Science
SP - 182
EP - 193
BT - Text, Speech, and Dialogue - 28th International Conference, TSD 2025, Proceedings
A2 - Ekštein, Kamil
A2 - Konopík, Miloslav
A2 - Pražák, Ondrej
A2 - Pártl, František
PB - Springer Science and Business Media Deutschland GmbH
T2 - 28th International Conference on Text, Speech, and Dialogue, TSD 2025
Y2 - 25 August 2025 through 28 August 2025
ER -