Abstract
In this study we propose how to modify a standard approach for text-to-speech alignment to apply in the case of alignment of lyrics and singing voice. We model phoneme durations by means of a duration-explicit hidden Markov model (DHMM) phonetic recognizer based on MFCCs. The phoneme durations are empirically set in a probabilistic way, based on prior knowledge about the lyrics structure and metric principles, specific for the Beijing opera music tradition. Phoneme models are GMMs trained directly on a small corpus of annotated singing voice. The alignment is evaluated on a cappella material from Beijing opera, which is characterized by its particularly long syllable durations. Results show that the incorporation of music-specific knowledge results in a very high alignment accuracy, outperforming significantly a baseline HMM-based approach.
| Original language | English |
|---|---|
| DOIs | |
| Publication status | Published - 2016 |
| Externally published | Yes |
| Event | 6th International Workshop on Folk Music Analysis - Dublin, Ireland Duration: 15 Jun 2016 → 17 Jun 2016 |
Conference
| Conference | 6th International Workshop on Folk Music Analysis |
|---|---|
| Country/Territory | Ireland |
| City | Dublin |
| Period | 15/06/16 → 17/06/16 |
Keywords
- text-to-speech alignment
- lyrics and singing voice
- duration-explicit hidden Markov model
- phoneme durations
- Beijing opera
- GMMs
- alignment accuracy