Automatic forced alignment on child speech: Directions for improvement


Phonetic analysis is labor intensive, limiting the amount of data that can be considered. Recently, automated techniques (e.g., forced alignment based on Automatic Speech Recognition - ASR) have emerged allowing for much larger-scale analyses. For adult speech, forced alignment can be accurate even when the phonetic transcription is automatically generated, allowing for large-scale phonetic studies. However, such analyses remain difficult for children’s speech, where ASR methods perform more poorly. The present study used a trainable forced aligner that performs well on adult speech to examine the effect of four factors on alignment accuracy of child speech: (1) Corpora - elicited speech (multiple children) versus spontaneous speech (single child); (2) Pronunciation dictionary – standard adult versus customized; (3) Training data – adult lab speech, corpus-specific child speech, all child speech, or a combination of child and adult speech; (4) Segment type – voiceless stops, voiceless sibilants, and vowels. Automatic and manual segmentations were compared. Greater accuracy was observed with (1) elicited speech, (2) customized pronunciations, (3) training on child speech, and (4) stops. These factors increase the utility of analyzing children’s speech production using forced alignment, potentially allowing researchers to ask questions that otherwise would require weeks or months of manual-segmentation.

In Proceedings of Meetings on Acoustics, IEEE.