Model transcribes user speech into phoneme sequences, detects errors, scores using metrics, and generates articulator visualizations and feedback: relax the tongue and place it against the roof of the mouth, with the tip lightly touching the teeth.
Phonetic error detection, as a core subtask of automatic pronunciation assessment, aims to identify pronunciation deviations at the fine-grained phoneme level. However, variability in both speech production and perception, including accents, and dysfluencies, presents a significant challenge for phoneme recognition. Current models are unable to capture these discrepancies effectively. In this work, we propose a framework for verbatim phoneme recognition, employing multi-task training with a novel phoneme similarity modeling. Unlike most previous studies that focus on transcribing what the person is supposed to say, our method aims to transcribe what the person actually said. We develop a simulated dataset VCTK-accent contains phonetic errors, which is open-sourced, and propose two novel metrics for assessing pronunciation differences. Our work provides a new benchmark for the phonetic error detection task.
We propose three methods for phoneme similarity modeling: the heuristic-based method stems from the perspective of phoneme classification and definition. Both the articulatory and Sylber-based methods are data-driven approaches: the former emphasizes the acoustic aspect, while the latter focuses more on the syllabic aspect.
The TTS-based simulated datasets we proposed in this work, extended from VCTK corpus, which contains vowel and consonant phonetic errors.
Audio | Gound Truth | Model Transcription | Scores |
---|---|---|---|
Historical
HH IH S T [AO->EY] R IH K AH L |
HH IH S T EY R IH K AH L | PER: 0% WPER: 0% AER: 0% |
|
Suspended
S [AH->IY] S P EH N D IH D |
S IY S P EH N D IH D | PER: 0% WPER: 0% AER: 0% |
|
Suspension
S AH S P EH N SH [AH->IY] N |
S AH S P EH N SH IY N | PER: 0% WPER: 0% AER: 0% |
|
Conflict
K [AA->IH] N F L IH K T |
K IH N F L IH K T | PER: 0% WPER: 0% AER: 0% |
|
Consistent
K AH N S IH S T [AH->IY] N T |
K AH N S IH S T IY N T | PER: 0% WPER: 0% AER: 0% |
|
Rooms
R UW M [Z->DH] |
R UW M DH | PER: 0% WPER: 0% AER: 0% |
|
Graves
[G->P] R EY IH V Z |
P R EY IH V Z | PER: 0% WPER: 0% AER: 0% |
|
Defences
[D->R] IH F EH N S IH Z |
R IH F EH N S IH Z | PER: 0% WPER: 0% AER: 0% |
|
Removed
[R->D] IH M UW V D |
D IH M UW V D | PER: 0% WPER: 0% AER: 0% |
|
Arises
ER R AA IH [Z->DH] IH Z |
ER R AA IH DH IH Z | PER: 0% WPER: 0% AER: 0% |
We provide samples from MultiPA and Speechocean762 here.
Audio | Gound Truth | Model Transcription | Scores |
---|---|---|---|
Meaningful
M IY N IH (NG) F AH L |
M IY N IY T F AH L | PER: 28.57% WPER: 17.20% AER: 14.59% |
|
Farmer
F AA (R) M ER |
F AA M ER | PER: 0% WPER: 0% AER: 0% |
|
Likes
L [AY->AE] K S |
L AE K Z | PER: 25% WPER: 1.54% AER: 4.32% |
|
Mountains
M [AW->AA] N T AH N Z |
M AA N P AH N S | PER: 28.57% WPER: 8.69% AER: 7.84% |
|
Pig
P IH [G->K] |
P IH K | PER: 0% WPER: 0% AER: 0% |
|
Welcome
W [EH->IH] L K AH M |
W IH L K AH M | PER: 0% WPER: 0% AER: 0% |
Pipeline of phoneme recognition and error detection: phonetic error of "IY" -> "EY" with a similarity score 79\% in the word "Please", and the articulatory feedback is moving the tongue towards the front of the mouth.