SSDM: Scalable Speech Dysfluency Modeling

Abstract. Speech dysfluency modeling is the core module for spoken language learning and speech therapy. However, there are three challenges. First, current state-of-the-art solutions suffer from poor scalability. Second, there is a lack of a large-scale dysfluency corpus. Third, there is not an effective learning framework. In other words, we are at a LeNet moment. In this paper, we propose SSDM: Scalable Speech Dysfluency Modeling, which (1) adopts articulatory gestures as scalable forced alignment; (2) introduces connectionist subsequence aligner (CSA) to achieve dysfluency alignment; (3) introduces a large-scale simulated dysfluency corpus called Libri-Dys; and (4) develops an end-to-end system by leveraging the power of large language models (LLMs). We expect SSDM to serve as a standard in the area of dysfluency modeling.

Note that term "dysfluency" and "disfluency" are interchangable.



PPA Speech

Primary Progressive Aphasia (PPA) is characterized by progressive impairments in speech and language. PPA can be categorized into three main variants, each with distinct clinical features. (1) Semantic Variant PPA (svPPA). Main Feature: Loss of word meaning. (2) Logopenic Variant PPA (lvPPA). Main Feature: Impairment in word retrieval and sentence repetition. (3) Nonfluent/Agrammatic Variant PPA (nfvPPA). Main Feature: Difficulty in forming grammatically correct sentences and impaired speech production. We provide some examples for nfvPPA.

Audio LTU-AS 13B SALMONN 13B ChatGPT SSDM

Instruction: What do you think of the pronunciation?
Reference Text (Optional): Well he is nearly 93 years old
The pronunciation is clear and understandable, with no apparent errors or difficulties in articulating words. The articulation in this sentence is clear and easy to understand. The speaker is pronouncing each word clearly and distinctly, which makes it easier for the listener to follow along and understand the message being conveyed. Overall, the articulation in this sentence is excellent. The pronunciation of “Well, he is nearly 93 years old” sounds clear and easy to understand. There don’t seem to be any issues with it. The person is saying, "Well, he is nearly 93 years old." There is a stutter on the word "nearly" at 1.20 seconds.

Instruction: What do you think of the pronunciation?
Reference Text (Optional): He dresses himself in an old black frock coat
The pronunciation is clear and understandable, with no apparent errors or difficulties in articulating the words correctly. The speaker's voice quality is also good, which helps to convey the intended meaning effectively. The pronunciation of the words in the sentence is clear and easy to understand. The articulation of the consonants and vowels is accurate, and there are no mispronunciations. Overall, the pronunciation and articulation of the words in the sentence are good. The pronunciation of “He dresses himself in an old black frock coat” is clear and easy to understand. There don't seem to be any issues with it. The pronunciation is almost ok. There is repetition of phoneme "b" on word "black" at 3.30 seconds.

Instruction: What do you think of the pronunciation?
Reference Text (Optional): Usually several buttons missing
The speech is clear and easy to understand, but it may be difficult for some people with hearing impairments or accents. I think the pronunciation/articulation is a bit difficult to understand, especially for non-native speakers. The speaker seems to be emphasizing certain syllables and words, which can be distracting and take away from the overall clarity of the message. Additionally, the speaker seems to be speaking at a fast pace, which can make it difficult for listeners to follow along. Overall, I would say that the pronunciation/articulation could be improved to make the message easier to understand. It seems the speaker is saying “Usually, several modems may be the same”. There are multiple problems. The speaker should say "usually several buttons missing". There is an insertion of phoneme on the word "usually" at 1.00 seconds and there is a stutter on the word "buttons" at 3.48 seconds.

Instruction: What do you think of the pronunciation?
Reference Text (Optional): A long beard clings to his chin
The pronunciation is clear and understandable, with no apparent errors or mispronunciations As an AI language model, I don't have personal opinions or preferences. However, I can tell you that the pronunciation/articulation of the sentence appears to be correct based on the rules of the International Phonetic Alphabet (IPA). The pronunciation of “A long beard clings to his chin” is clear and easy to understand. There don’t seem to be any issues with it. The man is saying "A long beard clings to his chin". There is a stutter of phoneme "ch" on the word "chin" at 3.60 seconds.

Instruction: What do you think of the pronunciation?
Reference Text (Optional): Giving those who observe him a pronounced feeling of the utmost respect
The pronunciation is clear and articulate. I think the pronunciation/articulation is clear and easy to understand. The pronunciation of “giving those who observe him a pronounced feeling of the utmost respect” is clear and easy to understand. There don’t seem to be any issues with it. There are several pronunciation problems. First, there is a repetition of the word "those" at 1.36 seconds. Second, there is a repetition of the word "pronounced" at 6.10 seconds. Try to speak smoothly.

Dyslexia Speech

Dyslexia is a specific learning disability that primarily affects reading and related language-based processing skills. It is characterized by difficulties with accurate and/or fluent word recognition, poor spelling, and decoding abilities. These difficulties typically result from a deficit in the phonological component of language, which is often unexpected in relation to other cognitive abilities and the provision of effective classroom instruction. Dyslexia primarily happens among kids.

Audio LTU-AS 13B SALMONN 13B ChatGPT SSDM

Instruction: What do you think of the pronunciation?
Reference Text (Optional): We have often urged him
The pronunciation is clear and understandable, but it could be improved by practicing more. The pronunciation is correct. It sounds like there was a bit of hesitation and the pronunciation of "often" came out as "afton". (Note the understanding is forced to stop due to ChatGPT's silence pause property after "often", which is not reasonable) There is a stutter at time 1.10s to 4.20s for the word "urged". It seems the speaker is saying "agreed" and "argued".

Instruction: What do you think of the pronunciation?
Reference Text (Optional): Giving those who observe him
The pronunciation is clear and understandable, which suggests that the speaker has good articulation and delivery. The pronunciation of "giving those who observe him" is correct. The word "giving" is pronounced with a hard "g" sound, and the word "those" is pronounced with a soft "th" sound. The word "observe" is pronounced with a short "ob" sound and a short "serve" sound. Overall, the pronunciation is clear and easy to understand. It sounds like the beginning of the phrase is correct, but it's cut off. (Note the understanding is forced to stop due to ChatGPT's silence pause property after "often", which is not reasonable) There is a problem with the pronunciation of the word "observe" from 2.3 seconds to 8.0 seconds. There are stutters and prolongations.

Instruction: What do you think of the pronunciation?
Reference Text (Optional): Usually several buttons missing
The pronunciation is clear and understandable, but it could be improved by adding more emphasis on certain syllables or using a different tone. I think the pronunciation of "usually several buttons missing" is clear and easy to understand. It sounds like "several" came out as "swavel". Try saying it slowly first. There are some phonetic errors for the word "several" at time 1.06s-1.46s. It seems the speaker is saying "swavel".

Artificial Simulated Speech

Samples from Libri-Dys test set.

Audio LTU-AS 13B SALMONN 13B ChatGPT SSDM

Instruction: What do you think of the pronunciation?
Reference Text (Optional): Also, a popular contrivance whereby love making may be suspended but not stopped during the picnic season.
The speech is clear and easy to understand I think that the problems for pronunciation and articulation for each phoneme and word are significant. It can be challenging for learners to master the sounds and rhythms of the language, especially if they are not native speakers. However, with practice and exposure to the language, learners can improve their pronunciation and articulation over time. Your pronunciation is quite clear overall! The word "contrivance" might benefit from a bit more emphasis on the second syllable: "con-TRI-vance." You are attempting to say, 'Also, a popular contrivance whereby lovemaking may be suspended but not stopped during the picnic season.' There is a missing phoneme in the word 'picnic' at 7.8s to 8.0s.

Instruction: What do you think of the pronunciation?
Reference Text (Optional):Also, a popular contrivance whereby love making may be suspended but not stopped during the picnic season.
The speech is clear and easy to understand, with no noticeable accent or pronunciation issues. The speaker's articulation is precise. The pronunciation is clear and easy to understand. Your pronunciation is clear and overall flow is natural. Your articulation is almost perfect. Note that there is a prolongation at the word 'season' at 8.0s.

Instruction: What do you think of the pronunciation?
Reference Text (Optional): Also, a draft on futurity, sometimes honored, but generally extended.
The pronunciation and articulation are clear and understandable in this audio clip. The pronunciation is perfect. Your pronunciation is clear. For "TURDI," make sure to emphasize each syllable distinctly if it's an acronym: "TUR-DI. Your pronunciation is clear. There is a missing syllable in the word 'futurity' at 1.9s to 2.1s.

Instruction: What do you think of the pronunciation?
Reference Text (Optional): This is a LibriVox recording.
The pronunciation and articulation are clear, with no noticeable errors or hesitations in any of the words As an AI language model, I don't have personal opinions, but I can tell you that pronunciation and articulation can be challenging for learners of any language. Your pronunciation is almost perfect. The name "Leigh" should be pronounced with a long "e" sound, like "Lee." There is a significant prolongation of the word 'LibriVox' at 0.8s to 1.4s.

Instruction: What do you think of the pronunciation?
Reference Text (Optional): You took the thing down?
The pronunciation and articulation are clear and understandable. The pronunciation is clear and easy to understand. Perfect! The phrase "took the" should flow smoothly, almost like one word: "tookthe." It seems for the word 'you,' you are saying 'who' at 0.5s to 0.8s.