Towards Accurate Phonetic Error Detection
through Phoneme Similarity Modeling

Demo

Description of image

Model transcribes user speech into phoneme sequences, detects errors, scores using metrics, and generates articulator visualizations and feedback: relax the tongue and place it against the roof of the mouth, with the tip lightly touching the teeth.

Abstract

Phonetic error detection, as a core subtask of automatic pronunciation assessment, aims to identify pronunciation deviations at the fine-grained phoneme level. However, variability in both speech production and perception, including accents, and dysfluencies, presents a significant challenge for phoneme recognition. Current models are unable to capture these discrepancies effectively. In this work, we propose a framework for verbatim phoneme recognition, employing multi-task training with a novel phoneme similarity modeling. Unlike most previous studies that focus on transcribing what the person is supposed to say, our method aims to transcribe what the person actually said. We develop a simulated dataset VCTK-accent contains phonetic errors, which is open-sourced, and propose two novel metrics for assessing pronunciation differences. Our work provides a new benchmark for the phonetic error detection task.

Phoneme Similarity Modeling

Description of image

We propose three methods for phoneme similarity modeling: the heuristic-based method stems from the perspective of phoneme classification and definition. Both the articulatory and Sylber-based methods are data-driven approaches: the former emphasizes the acoustic aspect, while the latter focuses more on the syllabic aspect.

Verbatim Phoneme Recognition

VCTK-accent

The TTS-based simulated datasets we proposed in this work, extended from VCTK corpus, which contains vowel and consonant phonetic errors.

Audio Gound Truth Model Transcription Scores
Historical
HH IH S T [AO->EY] R IH K AH L
HH IH S T EY R IH K AH L PER: 0%
WPER: 0%
AER: 0%
Suspended
S [AH->IY] S P EH N D IH D
S IY S P EH N D IH D PER: 0%
WPER: 0%
AER: 0%
Suspension
S AH S P EH N SH [AH->IY] N
S AH S P EH N SH IY N PER: 0%
WPER: 0%
AER: 0%
Conflict
K [AA->IH] N F L IH K T
K IH N F L IH K T PER: 0%
WPER: 0%
AER: 0%
Consistent
K AH N S IH S T [AH->IY] N T
K AH N S IH S T IY N T PER: 0%
WPER: 0%
AER: 0%
Rooms
R UW M [Z->DH]
R UW M DH PER: 0%
WPER: 0%
AER: 0%
Graves
[G->P] R EY IH V Z
P R EY IH V Z PER: 0%
WPER: 0%
AER: 0%
Defences
[D->R] IH F EH N S IH Z
R IH F EH N S IH Z PER: 0%
WPER: 0%
AER: 0%
Removed
[R->D] IH M UW V D
D IH M UW V D PER: 0%
WPER: 0%
AER: 0%
Arises
ER R AA IH [Z->DH] IH Z
ER R AA IH DH IH Z PER: 0%
WPER: 0%
AER: 0%

Real Speech

We provide samples from MultiPA and Speechocean762 here.

Audio Gound Truth Model Transcription Scores
Meaningful
M IY N IH (NG) F AH L
M IY N IY T F AH L PER: 28.57%
WPER: 17.20%
AER: 14.59%
Farmer
F AA (R) M ER
F AA M ER PER: 0%
WPER: 0%
AER: 0%
Likes
L [AY->AE] K S
L AE K Z PER: 25%
WPER: 1.54%
AER: 4.32%
Mountains
M [AW->AA] N T AH N Z
M AA N P AH N S PER: 28.57%
WPER: 8.69%
AER: 7.84%
Pig
P IH [G->K]
P IH K PER: 0%
WPER: 0%
AER: 0%
Welcome
W [EH->IH] L K AH M
W IH L K AH M PER: 0%
WPER: 0%
AER: 0%

Pipeline

Description of image

Pipeline of phoneme recognition and error detection: phonetic error of "IY" -> "EY" with a similarity score 79\% in the word "Please", and the articulatory feedback is moving the tongue towards the front of the mouth.