Abstract
The advancement of information and communication technologies has catalyzed the development of computer-assisted pronunciation training (CAPT) as an active research domain with evolving focal points. To facilitate a comprehensive understanding of how technology has been employed to aid in the teaching and learning of pronunciation, as well as to identify pressing research issues in this domain, the present paper aims to synthesize existing studies on computer-assisted pronunciation. The present study employed a corpus comprised of 403 article abstracts. These abstracts were sourced from a total of 153 peer-reviewed journals, drawing data from both Scopus and Dimensions databases. The corpus was subsequently subjected to rigorous analysis to uncover various trends, patterns, and themes pertinent to the research objectives. Various structural and temporal metrics were also scrutinized. The findings reveal that 14 clusters represent the most concentrated areas of research within the CAPT landscape. Themes associated with technology-assisted pronunciation—such as visual strategies, segmental production, and pedagogical approaches to mastering pronunciation—emerged as dominant foci. Additionally, the results spotlighted several influential articles characterized by a high burst in citations, as well as leading journals in the fields of language learning, educational technology,and speech recognition that frequently co-cited these articles.
Introduction
Mastering the pronunciation of a second language can be a formidable challenge for adult learners, primarily due to acquisition decline with age (Neri et al., 2008) and the varying phonetic systems across languages. One effective solution to improve foreign language pronunciation is immersing oneself in target language speaking environments, where learners can engage with native speakers and absorb their accents and intonations through extensive exposure (Sousa, 2010). However, thanks to technological advancements, an alternative approach has emerged. Technology offers a virtual atmosphere that enables learners to interact with native speakers without the need to travel to foreign countries. Through these virtual platforms, learners can simulate conversations with native speakers, gaining valuable insights into native-like pronunciation. Moreover, a plethora of software and apps now exists to assist learners in practicing pronunciation, augmenting learners’ motivation, reducing their anxiety, and receiving automated and individualized feedback—a unique feature not easily attainable through face-to-face interactions in a natural environment (Rogerson-Revell, 2021). The incorporation of technology into language education has fundamentally altered the landscape of pronunciation training. Learners are now able to take advantage of virtual immersion environments and deploy specialized tools to improve their pronunciation. This technological aid serves to narrow the discrepancy between learners’ speech patterns and those of native speakers (Blake, 2013).
An increasing number of studies have reviewed the implementation of CAPT in learning pronunciation. In a meta-analysis of 20 studies, Author 2 and a colleague (2019) compared the outcomes of students learning foreign language pronunciation using a computer to those learning pronunciation using conventional teaching techniques. The meta-analysis revealed a medium effect size of computer-assisted training on FL pronunciation (d = 0.68). They discovered that CAPT is equally helpful for both adult and young learners. They also found that CAPT is more effective for beginner and intermediate learners than for advanced learners. They discovered that college students benefit more from computer-assisted pronunciation training (CAPT) than do students in schools. Similarly, Tseng et al. (2022) conducted another meta-analysis to find out the impact of MALL on L2 pronunciation learning. They analyzed 13 studies that were published between 2009 and 2020. A medium effect size (d = 0.66) was found through analysis. Additionally, Rogerson-Revell (2021) provided a summary of the current issues and directions of CAPT. They proposed a collaboration between pedagogical purposes, technical functioning, and design. They advised testing the efficacy of CAPT tools and systems as well as providing precise and personalized automated feedback for pronunciation for both learning and assessment. In addition, Amrate and Tsai (2024) conducted a systematic review of only empirical studies to explore the trends of CAPT research based on the pedagogy of second language pronunciation instruction and assessment. The findings revealed a predominant focus on adult English learners, with an emphasis on the production of segmental features rather than suprasegmental features. The training methods used in these studies were mainly traditional drilling methods through listen-and-repeat and read-aloud activities. In addition to these reviews, Authors (2024) conducted a meta-analysis of 18 experimental studies examining the effect of High-Variability Phonetic Training (HVPT), a CAPT technique, on L2 pronunciation. The results revealed that HVPT had a medium overall effect size on L2 pronunciation. For the segmental aspects of pronunciation, the effect was large when utilized by advanced learners, while beginners showed medium effect sizes.
The above-mentioned reviews have covered many aspects of CAPT. However, they are limited in two aspects. First, they have not tracked the developmental trends of CAPT issues. Second, most reviews analyzed a relatively small sample of articles. These reviews fail to produce a comprehensive analysis of the general CAPT field. Therefore, a large-scale analysis using thematic mapping analysis is required. The vast literature of CAPT has not yet been much synthesized from scientometric perspectives: a field of study that analyzes the literature of a particular field or a sub-field of study to uncover patterns, trends, and relationships within the scientific literature, aiming to provide insights into the growth and development of scientific disciplines (Correia et al., 2018). Therefore, the aim of this paper is to synthesize the research conducted on CAPT to comprehend the emerging research topics related to the technology employed and the specific aspects of pronunciation that were addressed in these studies.
Computer-assisted pronunciation training (CAPT)
Technology has facilitated greater access to educational materials, both physically and in cognitive and psychological dimensions(Pennington and Rogerson-Revell, 2019). Since their introduction to educational settings in the 1960s, computers have undergone several notable shifts in their application to teaching and learning (Warschauer and Healey, 1998). The initial software programs specifically designed for English pronunciation emerged in the early 1970s (Kalikow and Swets, 1972). However, these systems experience a qualitative leap with the advent of multimedia technologies. Various educational programs have been developed over the years, premised on the understanding that language acquisition is a longitudinal process. These tools are designed to address different language skills at diverse proficiency levels using multiple pedagogical approaches.
The incorporation of computers in language education serves to enhance learner autonomy and elevate self-esteem. CAPT, in particular, fosters an environment devoid of stress, allowing students to engage with learning materials at their individual pace, a benefit emphasized by Neri et al. (2002). For example, Chen (2014) examined the effectiveness of MyET, an online pronunciation software, among Taiwanese college students. The study revealed not only improvements in pronunciation but also a growing preference among students for software-assisted instruction. Similarly, Gorjian et al. (2013) demonstrated the superiority of CALL approaches over traditional methods in teaching prosodic features of English, such as stress and intonation, using Praat software. This was corroborated by AbuSeileek (2007), who found an improvement in students’ abilities to perceive and produce stress patterns through the use of Mouton Interactive Introduction to Phonetics and Phonology software. However, it should be noted that the efficacy of CALL in pronunciation training is not universally affirmed. Liu (2008) found negligible differences in pronunciation improvement between traditional methods and Pronunciation Power 2, a software tool, among ESL students in the United States. This study replicated Seferoğlu (2005), who examined the same software but arrived at different conclusions regarding its effectiveness. CAPT has been examined based on a wide range of topics. These topics cover the tools used in teaching and learning pronunciation, such as web-based resources and ASR. These topics also include visual feedback, segmental aspects of pronunciation, and teaching pronunciation for foreign learners.
Visual feedback
The use of visual feedback is one of the advantages that CAPT exercises applied for language learning (Garcia et al., 2018). In pronunciation instruction, visual feedback is essential for improving learners’ understanding and refinement of speech patterns. To close the gap between auditory perception and articulatory precision, visual feedback shows learners their speech production in graphical form. Visual feedback in pronunciation training serves a dual purpose: it does not only offer learners a graphical depiction of their phonetic output but also enables a comparative analysis with a native speaker’s model (Olson, 2022). Visual feedback offers learners a comprehensive and multisensory method for improving their spoken language competency while also making a substantial contribution to the development of precise pronunciation abilities. Several studies (e.g., Hirata, 2004; Motohashi-Saigo and Hardison, 2009; Olson, 2022; Patten and Edmonds, 2015) found that visual feedback is a promising method for pronunciation training.
Automatic speech recognition
One form of CAPT systems that allows students to freely practice on any topic is automatic speech recognition (ASR) technology. ASR evaluates speech input by contrasting it with a native speaker model that was built from a database containing a sizable amount of native speaker speech samples. ASR systems employ advanced algorithms to convert spoken words into written text, facilitating seamless voice commands, transcription, and language processing. ASR is frequently used to help students with their pronunciation. ASR increases learners’ pronunciation quality, accuracy of spoken grammar structures, and command of speech acts, according to studies (Dai and Wu, 2023).
One recent development of ASR is the use of mobile-based dictation ASR. Different from computer-based ASR systems, mobile-based dictation ASR does not generate an assessment score or highlight mispronounced syllables (Dai and Wu, 2023). Studies (e.g., Liakin et al., 2015; Mroz, 2018) found that learners benefited from using mobile-based ASR in improving pronunciation. Similarly, ASR-based websites are reported to successfully help learners enhance their pronunciation (Bashori et al., 2022). Although ASR has benefits for learning pronunciation, the accuracy and adaptability of ASR are influenced by factors such as accent, background noise, and linguistic diversity. In addition, ASR has some restrictions and drawbacks, including the difficulty in use experienced by learners and the requirement for more precise dictation transcripts for non-native speech (McCrocklin, 2012).
Segmental pronunciation
Pronunciation encompasses various elements that collectively contribute to effective communication. Two broad categories that encapsulate these elements are segmental and suprasegmental aspects. Segmental pronunciation pertains to the accurate production of individual speech sounds, which include vowels and consonants. The mastery of segmental features is crucial for conveying meaning and differentiating words in a language. This involves understanding and properly articulating the specific sound patterns that distinguish words from one another. In contrast, suprasegmental elements encompass features that extend beyond individual speech sounds and involve broader patterns of speech rhythm, stress, tone, and intonation. These suprasegmental aspects influence the overall prosody of speech.
The debate over whether to prioritize segmental or suprasegmental elements when teaching pronunciation has drawn attention from scholars and educators. Some researchers, such as Pennington and Richards (1986) and Wong (1987), argued in favor of giving more prominence to suprasegmental aspects during pronunciation instruction. The reasoning behind this is that suprasegmental features greatly influence the naturalness of language use. However, there is a recognized challenge associated with teaching and acquiring suprasegmental elements, particularly for second language learners. Learning these features involves becoming attuned to the complex patterns of stress, intonation, and rhythm that are unique to the target language. Since these patterns might not align with those of the learners’ native language, recognition and comprehension can be difficult. Furthermore, earlier studies (e.g., Anderson‐Hsieh et al., 1992; Chun, 2002) have predominantly focused on suprasegmental aspects, with an emphasis on intonation contours. Research has indicated that improving L2 suprasegmental pronunciation, especially through CAPT, can yield positive results in terms of intelligibility and communication enhancement. While there has been substantial research on the benefits of CAPT for improving suprasegmental pronunciation, the emphasis on segmental instruction has sometimes been comparatively underexplored (Olson, 2014). To achieve comprehensive and effective language articulation, it is important to strike a balance between both segmental and suprasegmental elements.
Foreign language pronunciation
Teaching pronunciation to foreign language learners is a fundamental and crucial aspect of language learning. The ability to pronounce words accurately not only impacts effective communication but also influences how learners are perceived by native speakers. However, the approach to teaching pronunciation has evolved significantly over the last two decades. Traditionally, there was a strong emphasis on achieving native-like accents in language learners. The belief was that a flawless native accent indicated language mastery. However, this perspective has shifted as pronunciation scholars and educators have recognized that aiming for native-like accents can be unrealistic for many learners and can lead to unnecessary frustration (Derwing and Munro, 2015; Levis, 2005). They propose that the primary objectives of teaching pronunciation should be intelligibility and comprehensibility. A notable development in this area is the notion that achieving a perfect native accent is not the only measure of success in language learning. Instead, intelligibility and comprehensibility are the benchmarks for effective communication.
Scientometrics
In recent years, there has been a growing trend across various disciplines to synthesize existing literature in order to identify key research hotspots and emerging trends over time. This synthesis aims to provide a comprehensive overview for both academia and practitioners, facilitating a nuanced comprehension of the evolving landscape within the CAPT domain, its collaborative network, and its primary research goals. The findings of this synthesis distinctly highlight dominant research themes and existing gaps present within the corpus of literature, delineating crucial focal points that hold significance for forthcoming CAPT endeavors, both in terms of practice and research. These insights bear the potential to serve as guiding touchstones for scholars and researchers when delineating the specific arenas warranting heightened investigation in future initiatives. Moreover, the synthesized overview equips scholars with a valuable reference, enabling them to ascertain optimal trajectories for their scholarly pursuits.
The comprehensive literature review conducted in this study lays the foundation for addressing the following research questions:
The research questions are as follows:
RQ1: What are the emerging themes and clusters of CAPT research?
RQ2: What are the most influential research articles, co-cited references, and co-cited sources of publication in CAPT research?
Methods
Data source
One limitation of the previous bibliometric studies is the shortage of resources. To address that and to gain a fine-grained concept of CAPT in the SLA literature, we use the Scopus database to cover a broader relevant CAPT literature because Scopus covers thousands of peer-reviewed journals (Lim and Aryadoust, 2021; Author 1 and 3, 2024; Authors 1 and 3, 2023). To get more retrieved data, we also use the Dimensions database to cover research articles in newly indexed journals that are not covered by Scopus. For example, some time spans for non-core CALL journals such as JALT CALL Journal (2007–2014), International Journal of Computer Assisted Language Learning and Teaching (2011–2016), and Journal of Second Language Pronunciation (2015–2018) were not covered by Scopus. Another reason is that Scopus and Dimensions provide downloadable data in formats (RIS and CSV) that are compatible with software VOSviewer, CiteSpace, and BibExcel used to analyze the retrieved data.
Key terms
The following key terms were placed in Scopus based on a recent bibliometric analysis and a meta-analysis study with slight modifications: TITLE-ABS-KEY(Computer Assisted Pronunciation Training or CAPT) OR TITLE-ABS-KEY(pronunciation AND technology) OR TITLE-ABS-KEY(automatic speech recognition or ASR) OR TITLE-ABS-KEY(High Variability Phonetic Training or HVPT) OR TITLE-ABS-KEY(computer AND pronunciation) OR TITLE-ABS-KEY(speech-to-text) OR TITLE-ABS-KEY(mobile assisted pronunciation training) OR TITLE-ABS-KEY(Web-based pronunciation) OR TITLE-ABS-KEY(multimedia AND pronunciation) OR TITLE-ABS-KEY(games AND pronunciation) AND (LIMIT-TO(SRCTYPE,“j”)) AND (LIMIT-TO(DOCTYPE,“ar”)) AND (LIMIT-TO(SUBJAREA,“SOCI”) OR LIMIT-TO(SUBJAREA,“ARTS”) OR LIMIT-TO(SUBJAREA,“PSYC”) OR LIMIT-TO(SUBJAREA,“MULT”)) AND (LIMIT-TO(LANGUAGE,“English”))
The search was conducted on the “Document” type, which searches titles, yielding 26,669 documents. The search was filtered through the document type of “article” published in English from 1977 to 2022. Our justification is that our focus is on the trending issues being investigated, which could be collected from research articles. The articles were also refined to cover disciplines that publish target journals whose scope is related to our investigated issues. Therefore, the results were further refined to cover fields of Arts and Humanities, Social Sciences, Psychology, and Multidisciplinary, resulting in 2153 articles. Then, these obtained articles were filtered to match the research objectives.
Data filtering
The articles’ abstracts of the raw data were screened to find out whether they integrated technology to aid pronunciation learning and teaching. Three raters checked the scopes of the collected articles and coded them: if they were CAPT-specific, a value of 1 was coded; if they were not CAPT-specific articles, a value of 0 was provided, and then they were excluded. There were some cases where articles’ abstracts did not explicitly tell whether CAPT was manipulated: full articles were checked to remove doubts. If doubts persisted, a discussion among the raters was conducted to reach a decision. The inter-rater reliability was (Coppa = 0.95). This process resulted in the exclusion of 1750 articles that did not meet our inclusion criteria, leaving 403 articles eligible for analysis. A total of 403 articles met the inclusion criteria and were included in the analysis. These articles were published across 153 different journals, indicating the breadth of research in the field of CAPT.
Screening spelling discrepancies
To ensure consistency in terms of the data collected and to guarantee the validity of the data input, several steps were taken to ensure verification and standardization. CiteSpace was utilized to retrieve data and identify misspelled authors’ names, study titles, publication sources, and issue and volume numbers (See Supplementary Materials A). The obtained data from CiteSpace was then organized based on authors’ names and imported into an Excel sheet for errors to be easily identified and corrected in the original files CSV and RIS. Additionally, we employed VOSviewer to check for inconsistencies and acronyms in authors’ keywords. Word co-occurrences based on authors’ keywords were analyzed, and the data from VOSviewer was exported to a CSV file. During this process, it was observed that certain keywords were counted separately despite being similar, such as asr, automatic speech recognition, automatic speech recognition (asr), and automatic speech recognition technology. To rectify these terms, they were unified and renamed as “automatic speech recognition.” Terms in singularity and plurality were merged into one term, such as educational technology and educational technologies. Synonymous words were amalgamated as assessment and evaluation. Finally, the data was converted to XLSX format for further analysis. The sources of publication were refined by reviewing the initial data extracted from CiteSpace, ensuring consistency in both the publication source titles and the references section for each study (refer to Supplementary Material B). For instance, the journal “Speech Communication Journal” was cited in three different ways, including “SPEECH COMM, SPEECH COMMUN, and SPEECH COMMUNICATION.”
Data analysis
Three software applications were employed for the analysis. CiteSpace was utilized for document co-citation analysis, VOSviewer was used for source co-citation analysis, and BibExcel was employed to calculate the journal h-index. It is important to note that CiteSpace utilized both structural and temporal metrics in the analysis. Temporal metrics consisted of burstness, which signifies the surge of citations, and sigma value, which quantifies the scientific originality of a citation, indicating its novelty within the scientific literature (Aryadoust, 2020). On the other hand, structural metrics included Modularity Q effects, which assess the level of interconnectedness in a network and reveal the clarity of the network structure at the cluster level (Chen and Song, 2019). The silhouette score was employed to gauge the uniformity and similarity of members within a cluster, with higher scores (closer to +1) indicating greater homogeneity (Solmi et al., 2022). Furthermore, betweenness centrality was used to determine the significance of nodes in a network, particularly their positioning between correlated clusters.
Results
This section presents the findings of the scientometric analysis conducted to explore the research trends and evolution over time in the field of CAPT. This analysis utilized CiteSpace, VOSviewer, and Bibexcel, employing a timeframe from 1977 to 2022. The section is organized into four subsections, including Clusters, Citing and Co-cited References, Citing Authors, and Citing and Co-cited Journals. Each subsection provides valuable insights into the patterns, influential figures, and knowledge flow within the field of CAPT.
Clusters
We employed CiteSpace to perform a DCA, configuring the timeframe as 1977–2022. The link-retaining factor and look-back years options were set to (−1 = unlimited), enabling us to capture all potential outcomes within the specified period. From a total of 403 records, we identified 895 nodes and 4814 links, resulting in the formation of 127 distinct clusters. To narrow the clusters to meet our objectives, we opted to extract the major clusters, resulting in 14. We ignored small clusters that contained fewer than 5 studies (Authors, 1 and 3, 2024), yielding 10 clusters. The Modularity Q value was calculated to be 0.8317, indicating a significant level of clustering, while the average silhouette value of 0.9269 suggests a high degree of coherence within the clusters. To facilitate visualization, we focused on displaying the largest 10 clusters, which represent the forefront of research in CAPT spanning the past 45 years (see Fig. 1). Further details on the clusters can be found in Table 1.
Fig. 1
figure 1
Clusters mapping.
Full size image
Table 1 Clusters information.
Full size table
The largest cluster, “Visual Feedback” (Clusters #0), with 142 and 72 members respectively, examines the role of visual feedback and ASR in pronunciation instruction, particularly for foreign language learners. Key studies include and Olson (2014), which explore the effects of visual feedback on pronunciation accuracy. The second cluster “Segmented Production” contains 72 articles with around .9 silhouette number focuses on pronunciation production through segments. Cluster #2, “Training Japanese Listener” (70 members), focuses on training Japanese listeners to distinguish English sounds, with as a major citing article. Cluster #3, “Task-Specific Application” (43 members), highlights the use of ASR for nurse training, with as the leading study. Cluster #6, “Foreign Language Pronunciation” (25 members), examines automatic speech processing for pronunciation tutoring, featuring. Cluster #10, “Large Vocabulary Speech Recognition” (21 members), explores acoustic model improvements for speech recognition, led by. Cluster #12, “Computer” (14 members), discusses the role of visual feedback in computer-assisted pronunciation instruction, with Anderson‐Hsieh et al. (1992) as the key work. Cluster #29, “Use” (7 members), focuses on ASR for oral proficiency assessment, led by. Cluster #32, “Young EFL Learner” (6 members), examines virtual reality-assisted pronunciation training for young learners, with as the major citing article. Finally, Cluster #42, “Virtual Language Teacher” (5 members), explores feedback from virtual teachers in language instruction, with as the key study.
Co-cited references and citing authors
Co-cited references
Co-cited references are references that are frequently cited together in academic publications, indicating a connection between their work. Analyzing co-citation patterns helps identify influential figures and understand the development of ideas in a field (Chen et al., 2010; Small, 1973). In our analysis, we identified many references that made significant contributions to the field of CAPT (see Fig. 2). Among these references, Neri et al. (2002) was distinguished as the most frequently co-cited references, with frequencies of 25 and 18, respectively. Their research has garnered extensive citations and has had a profound impact on the field of CAPT.
Fig. 2
figure 2
Co-cited references mapping.
Full size image
Derwing and Munro (2005) closely follow with 17 co-citations, indicating the substantial influence of their research contributions. Likewise, Golonka et al. (2014) and Chiu et al. (2007) have been cited 16 and 13 times, respectively, highlighting their significant presence in the field. Table 2 provides a comprehensive compilation of the top 11 highly co-cited references, offering a comprehensive overview of their contributions and influence within the field.
Table 2 Co-cited references by frequency.
Full size table
Citation bursts
Centrality analysis provides insights into the most influential references within a research network. We identified the top 10 references that received the highest centrality scores. Neri et al.’s (2002) study obtained the highest centrality score of 0.09, emphasizing the significant impact of their work. Chun’s (1998) publication secured the second position with a centrality score of 0.08, followed by Derwing and Munro (2005) research with a score of 0.07. Table 3 includes the remaining notable references that have contributed significantly to the field.
Table 3 Top 10 references with the highest (betweenness) centrality.
Full size table
The burst detection analysis revealed references that experienced sudden surges in attention within specific periods. Liakin et al.’s (2015) reference emerged as the most prominent, scoring 5.67 between 2019 and 2022. Neri et al.’s (2008) work closely followed with a burstness score of 5.38. Figure 3 provides an overview of the top 10 co-cited references with the highest burstness. These references, identified by the presence of red treerings surrounding their nodes in Figs. 4 and 5, signify their significance in the network.
Fig. 3
figure 3
Top 10 references with strongest citation bursts.
Full size image
Fig. 4
figure 4
Timeline view of clusters with authors.
Full size image
Fig. 5
figure 5
Co-cited journals mapping.
Full size image
Citing authors
We identified and analyzed a total of 989 unique authors who made significant contributions to the pool of studies in the field of CAPT. The productions of these authors were meticulously tallied and examined to calculate their respective h-index scores.
Among the most influential authors in the pool of studies, Cucchiarini, C. and Strik, H. both possess an h-index of 8, with Cucchiarini accumulating 366 citations across 10 articles, and Strik amassing 368 citations across 11 articles, and Fouz-gonzález, J. with h-index 4 and 108 citations. Following closely is Li, H., with an h-index of 3 and an impressive 1170 citations across 3 articles. Li’s research has left a substantial mark on the field. Moreover, Eskenazi, M., also with an h-index of 3, has garnered 304 citations across 3 articles, further solidifying their influence. Table 4 presents a list of authors with the highest h-index scores.
Table 4 Citing authors metrics by h-index.
Full size table
Citing and co-cited journals
Citing journals
As previously mentioned, 153 journals have published research on CAPT. The most influential journals, based on the number of articles published and citations received, are presented in Table 5. This table lists the top 12 journals with the highest h-indices, reflecting both their productivity and citation impact. Among the analyzed journals, Speech Communication emerged as the most productive journal, with 66 articles meeting the inclusion criteria. Moreover, it boasted the highest h-index of 26, indicating its significant impact within the field. The Journal of the Acoustical Society of America and Computer Assisted Language Learning tied for second place with an h-index of 14. Nevertheless, the former had 23 articles meeting the inclusion criteria, while the latter had 24 articles. For further details about the journal titles that published CAPT research and their metrics, please see the Supplementary Materials C.
Table 5 Citing journals metrics by h-index.
Full size table
Co-cited journals
Co-cited journals are academic publications frequently cited together, suggesting a strong relationship or shared subject areas. Analyzing these citation patterns helps researchers grasp a field’s intellectual structure and identify influential journals. (Chen et al. 2010; Small, 1973).
Table 6 shows the highly co-cited journals within the domain of CAPT, indicating their significant contributions and influence on the research in this area. Journal of the Acoustical Society of America tops the list with a frequency of 522 co-citations, followed by Computer Assisted Language Learning with 390 co-citations. Speech Communication follows closely with 307 co-citations. Other prominent journals in the field include Language Learning & Technology, CALICO Journal, TESOL Quarterly, System, Language Learning, The Journal of the Acoustical Society of America, and ReCALL. These highly co-cited journals serve as valuable resources for researchers in the field and contribute to the advancement of CAPT. Figure 6 provides a visualization of the co-cited journals.
Table 6 Co-cited journals by frequency.
Full size table
Fig. 6
figure 6
Author keywords mapping.
Full size image
Author keywords
Author keywords are chosen by research paper authors to emphasize the primary subjects or ideas addressed in their work. By examining the frequency of these terms, one can uncover the core issues within the field being studied and explore the connections between topics and keywords (Li and Mingyu, 2018).
Among the top 11 keywords in the field of CAPT, Computer Assisted Language Learning emerges as the most prominent, with a frequency of 63 occurrences. This indicates the central role of computer and its significance in language learning. Pronunciation follows closely behind with 57 occurrences, highlighting the focus on this aspect within the field. Another important keyword is Automatic Speech Recognition, mentioned 49 times, showcasing its relevance and utilization in CAPT. Speech Recognition is referenced 29 times, further emphasizing its importance and relationship to the field. Mobile Assisted Language Learning appears 14 times, demonstrating the growing interest and exploration of incorporating mobile technology in pronunciation training. Table 7 summarizes the top 11 keywords used by authors.
Table 7 Author keywords by occurrences.
Full size table
Discussion
This scientometric analysis seeks to consolidate research on CAPT by using quantitative metrics to identify key research trends and examine how these topics have evolved over time. The results designate “Visual Feedback” as the predominant cluster, suggesting a concentrated research effort in exploring the efficacy of visual feedback for pronunciation learning.
The emerging themes of CAPT research
The focal point of CAPT research primarily centers on segmental aspects of pronunciation, such as consonants and vowels, as corroborated by systematic reviews (Amrate and Tsai, 2024); 2nd Author and colleague (2019); (Tseng et al., 2022), suggesting a potential gap in research on suprasegmental features crucial for overall communicative competence. CAPT has undergone significant transformations, including advancements in ASR for improved feedback accuracy, visualizations like spectrograms and waveforms, mobile applications making training more accessible (Alghazo et al., 2023), and Natural Language Processing (NLP) for more precise error identification and correction suggestions. These findings are in line with Rogerson-Revell’s (2021) observation that CAPT research highlights the need for greater synergy between technological design and functionality, and pedagogic purpose, focusing on accurate and individualized automated feedback for pronunciation. The analysis reveals some trends in the field of CAPT research. The temporal evolution of CAPT research reveals shifts from technology-centric to pedagogy-integrated approaches (Clymer et al., 2020), an increasing emphasis on learner autonomy through mobile applications and self-study tools, growing attention to cultural and linguistic diversity in pronunciation models (Alghazo and Zidan, 2019), and emerging trends in multimodal approaches combining visual, auditory, and kinesthetic feedback (Alghazo, 2015). These trends collectively demonstrate the field’s progression towards more comprehensive, inclusive, and pedagogically sound CAPT systems that cater to diverse learner needs and contexts.
The emerging clusters of CAPT research
The emerging clusters of CAPT research highlight the significant role of visual feedback in enhancing language learners’ pronunciation skills. Visual feedback, which includes cues and representations, provides learners with a clearer understanding of their pronunciation errors and areas for improvement. Studies indicate that CAPT utilizing visual feedback, such as ultrasound imaging, leads to better pronunciation accuracy compared to traditional auditory feedback (Bryfonski, 2023). The development of multimodal CAPT environments allows for tailored visual feedback across different languages, addressing specific phonetic challenges and improving learner engagement (Blake et al., 2024). Research shows that using visual speech cues significantly aids learners in distinguishing difficult sounds, especially in distracting auditory environments (Grabowska-Chenczke et al., 2023). While visual feedback is beneficial, some argue that it may not replace the necessity of auditory feedback entirely, as both modalities can complement each other in a comprehensive CAPT framework.
The most influential research articles in CAPT research
The substantial volume of publications across educational technology and language teaching journals corroborates the burgeoning interest in CAPT as a critical tool for foreign language acquisition. A thematic analysis of authors’ keywords—achieved through VOSViewer—serves as a barometer for gauging the evolving contours of CAPT research. Key terms such as “Computer Assisted Language Learning,” “Pronunciation,” “Automatic Speech Recognition,” “Speech Recognition,” and “Mobile Assisted Language Learning” all converge around the central theme of leveraging technology to enhance second/foreign language pronunciation skills. The recurrent themes indicate not just the vitality of the field but also suggest the multidimensional approaches being employed to tackle complex issues related to language acquisition and pronunciation.
The most co-cited references in CAPT research
The top three highly co-cited references were Neri et al. (2002), Levis (2007), and Derwing and Munro (2005). These studies focused on three crucial areas where computer technology and pronunciation interact: (1) appropriate educational goals and the evaluation of progress; (2) the capability of CAPT to provide helpful, automatic feedback; and (3) the application of technology in the identification of pronunciation problems. They also examined how the pedagogy and technology interacted in CAPT course. They also discussed how to select the best teaching strategies and setting the right pedagogical priorities for the classroom.
The most co-cited sources of publication in CAPT research
It is noteworthy that journals devoted to Computer-Assisted Language Learning (CALL) held a dominant position in terms of co-cited references, the frequency of articles published, and the harsh index (H-index), which is a measure of an author’s impact and productivity. The journals with the most impact on the field of research were those with a focus on CALL. It is vital to note that there are other publication venues that have also shown strong metrics in this field. For instance, “Speech Communication” and “Journal of the Acoustical Society of America” have received widespread publications that combine cutting-edge technology innovations with the nature of language acquisition and teaching approaches. These publications significantly promote the investigation of cutting-edge technologies for boosting language learning and instruction by exploring the connection of these technologies and language-related pedagogical difficulties. The H-index of authors who have contributed to studies on CAPT shows a similar pattern. Notably, the highest H-index ever observed in this situation is 8. This measure represents the authors’ level of influence and output in the field. Cucchiarini, C., Strik, H., and Fouz-gonzález, J. are three authors with the highest H-index scores. Their high H-index ratings demonstrate their significant impact on CAPT research.
Conclusion
This scientometric analysis serves as a comprehensive overview of the prevailing research landscape focused on CAPT as an innovative modality for L2 pronunciation improvement. The study employs a range of quantitative metrics to identify emergent clusters, serving as research hotspots within the CAPT domain. Although the current analysis serves as an initial foray that necessitates further in-depth exploration, the findings illuminate critical dimensions underpinning the developmental trajectory of CAPT. The collective aim of the analyzed body of literature is to uncover efficacious techniques for optimizing L2 pronunciation through the use of CAPT. These insights pave the way for targeted future research initiatives, contributing to the ongoing advancement of the field.
Certainly, this scientometric study has its limitations that provide avenues for subsequent research. Firstly, the current analysis may not encompass all relevant terminologies, indicating a scope for extension in future work to achieve a more exhaustive understanding of the subject. Secondly, the research is confined to a scientometric approach and does not venture into other bibliometric dimensions like collaboration metrics, geographical or institutional productivity, and word co-occurrence analyses. Future investigations are highly recommended to delve into these additional bibliometric facets for a more comprehensive and nuanced understanding of the field.
Data availability
The datasets generated during and/or analyzed during the current study are available from the corresponding author upon reasonable request.
References
AbuSeileek A (2007) Cooperative vs. individual learning of oral skills in a CALL environment. Comput Assist Lang Learn 20:493–514. https://doi.org/10.1080/09588220701746054
ArticleGoogle Scholar
Alghazo S, Jarrah M, Al Salem MN (2023) The efficacy of the type of instruction on second language pronunciation acquisition. Front Educ 8(1):1182285. https://doi.org/10.3389/feduc.2023.1182285
ArticleMATHGoogle Scholar
Alghazo SM (2015) The role of curriculum design and teaching materials in pronunciation learning. Res Lang 13(3):316–333
Google Scholar
Alghazo SM, Zidan MN (2019) Native-speakerism and professional teacher identity in L2 pronunciation learning. Indonesian J Appl Linguist 9(1):241–251
Google Scholar
Amrate M, Tsai PH (2024) Computer-assisted pronunciation training: a systematic review. ReCALL 1–21. https://doi.org/10.1017/S0958344024000181
Anderson‐Hsieh J, Johnson R, Koehler K (1992) The relationship between native speaker judgments of nonnative pronunciation and deviance in segmentais, prosody, and syllable structure. Lang Learn 42(4):529–555
Google Scholar
Aryadoust V (2020) A review of comprehension subskills: a scientometrics perspective. System 88:102180
MATHGoogle Scholar
Bashori M, van Hout R, Strik H, Cucchiarini C (2022) Web-based language learning and speaking anxiety. Comput Assist Lang Learn 35(5-6):1058–1089
Google Scholar
Blake J, Bogach N, Kusakari A, Lezhenin I, Khaustova V, Xuan SL, Pyshkin E (2024) An open CAPT system for prosody practice: practical steps towards multilingual setup. Languages 9(1):27. https://doi.org/10.3390/languages9010027
ArticleGoogle Scholar
Blake RJ (2013) Brave new digital classroom: technology and foreign language learning. Georgetown University Press
Bryfonski L (2023) Is seeing believing? The role of ultrasound tongue imaging and oral corrective feedback in L2 pronunciation development. J Second Lang Pronunc 9(1):103–129. https://doi.org/10.1075/jslp.22051.bry
ArticleGoogle Scholar
Chen AH (2014) The effects of employing MYET on college students’ pronunciation skills training in Taiwan. IJBRITISH 1(3):43–53
MATHGoogle Scholar
Chen C, Song M (2019) Visualizing a field of research: a methodology of systematic scientometric reviews. PloS ONE 14(10):e0223994
CASPubMedPubMed CentralGoogle Scholar
Chen C, Ibekwe-SanJuan F, Hou J (2010) The structure and dynamics of cocitation clusters: a multiple-perspective co-citation analysis. J Am Soc Inf Sci Technol 61(7):1386–1409. https://doi.org/10.1002/asi.21309
ArticleMATHGoogle Scholar
Chiu T-L, Liou H-C, Yeh Y (2007) A study of web-based oral activities enhanced by Automatic Speech Recognition for EFL college learning. Comput Assist Lang Learn 20(3):209–233. https://doi.org/10.1080/09588220701489374
ArticleGoogle Scholar
Chun D (2002) Discourse Intonation in L2: from theory and research to practice. University of California, Santa Barbara
MATHGoogle Scholar
Chun D (1998) Signal analysis software for teaching discourse intonation. Lang Learn Technol 2(1):74–93. http://llt.msu.edu/vol2num1/article4/
Clymer E, Alghazo S, Naimi T, Zidan M (2020) CALL, native-speakerism/culturism, and neoliberalism. Interchange 51(3):209–237
Google Scholar
Correia A, Paredes H, Fonseca B (2018) Scientometric analysis of scientific publications in CSCW. Scientometrics 114:31–89
MATHGoogle Scholar
Dai Y, Wu Z (2023) Mobile-assisted pronunciation learning with feedback from peers and/or automatic speech recognition: a mixed-methods study. Comput Assist Lang Learn 36(5-6):861–884. https://doi.org/10.1080/09588221.2021.1952272
ArticleMATHGoogle Scholar
Derwing TM, Munro MJ (2005) Second language accent and pronunciation teaching: a research-based approach. TESOL Q 39(3):379. https://doi.org/10.2307/3588486
ArticleGoogle Scholar
Derwing TM, Munro, MJ (2015) Pronunciation fundamentals: evidence-based perspectives for L2 teaching and research, vol. 42. John Benjamins Publishing Company
Garcia C, Kolat M, Morgan TA (2018) Self-correction of second-language pronunciation via online, real-time, visual feedback. Pronunc Second Lang Learn Teach Proc 9(1):54–65
MATHGoogle Scholar
Golonka EM, Bowles AR, Frank VM, Richardson DL, Freynik S (2014) Technologies for foreign language learning: a review of technology types and their effectiveness. Comput Assist Lang Learn 27(1):70–105. https://doi.org/10.1080/09588221.2012.700315
ArticleGoogle Scholar
Gorjian B, Hayati A, Pourkhoni P (2013) Using Praat software in teaching prosodic features to EFL learners. Procedia Soc Behav Sci 84:34–40. https://doi.org/10.1016/j.sbspro.2013.06.505
ArticleGoogle Scholar
Grabowska-Chenczke O, Francuz P, Bałaj B (2023) Role of visual speech cues (Cued Speech) in foreign language learning by hearing school-age children. Roczn Psychol 26(3):215–240
Google Scholar
Hincks R (2003) Speech technologies for pronunciation feedback and evaluation. ReCALL 15(1):3–20. https://doi.org/10.1017/S0958344003000211
ArticleMATHGoogle Scholar
Hirata Y (2004) Computer assisted pronunciation training for native English speakers learning Japanese pitch and durational contrasts. Comput Assist Lang Learn 17(3-4):357–376. https://doi.org/10.1080/0958822042000319629
ArticleMATHGoogle Scholar
Kalikow D, Swets J (1972) Experiments with computer-controlled displays in second-language learning. IEEE Trans Audio Electroacoust 20(1):23–28
MATHGoogle Scholar
Kim I-S (2006) Automatic speech recognition: reliability and pedagogical implications for teaching pronunciation. J Educ Technol Soc 9(1):322–334
MathSciNetMATHGoogle Scholar
Lee J, Jang J, Plonsky L (2015) The effectiveness of second language pronunciation instruction: a meta-analysis. Appl Linguist 36(3):345–366. https://doi.org/10.1093/applin/amu040
ArticleMATHGoogle Scholar
Levis J (2007) Computer technology in teaching and researching pronunciation. Annu Rev Appl Linguist 27:184–202. https://doi.org/10.1017/S0267190508070098
ArticleMATHGoogle Scholar
Levis J, Pickering L (2004) Teaching intonation in discourse using speech visualization technology. System 32(4):505–524. https://doi.org/10.1016/j.system.2004.09.009
ArticleMATHGoogle Scholar
Levis JM (2005) Changing contexts and shifting paradigms in pronunciation teaching. TESOL Q 39(3):369–377
MATHGoogle Scholar
Li L, Mingyu G (2018) Developments and prospects of legal English: a scientometric analysis. J Zhejiang Gongshang Univ 32(4):66–77
MATHGoogle Scholar
Liakin D, Cardoso W, Liakina N (2015) Learning L2 pronunciation with a mobile speech recognizer: French/y/. CALICO J 32(1):1–25. https://doi.org/10.1558/cj.v32i1.25962
ArticleGoogle Scholar
Lim MH, Aryadoust V (2021) A scientometric review of research trends in computer-assisted language learning (1977–2020). Comput Assist Lang Learn. https://doi.org/10.1080/09588221.2021.1892768
Liu Y (2008) The effectiveness of integrating commercial pronunciation software into an ESL pronunciation class. Master of Arts, Iowa State University
Mahdi HS, Al Khateeb AA (2019) The effectiveness of computer‐assisted pronunciation training: A meta‐analysis. Rev Educ 7(3):733–753
McCrocklin S (2012) Effect of audio vs. video on aural discrimination of vowels. TESL-EJ 16(2):1–16
Google Scholar
Mohsen MA, Althebi S, Alsagour R, Alsalem A, Almudawi A, & Alshahrani A (2024) Forty-two years of computer-assisted language learning research: A scientometric study of hotspot research and trending issues. ReCALL 36(2):230–249
Mohsen MA, Ho YS (2024) Thirty years of educational research in Saudi Arabia: a bibliometric study. Interact. Learn Environ 32(5):1763–1778
Mohsen MA, Althebi S & Albahooth M (2023). A scientometric study of three decades of machine translation research: Trending issues, hotspot research, and co-citation analysis. Cogent Arts & Humanities, 10(1):2242620
Motohashi-Saigo M, Hardison D (2009) Acquisition of L2 Japanese geminates training with waveform displays. Lang Learn Technol 13(2):29–47
Google Scholar
Mroz A (2018) Seeing how people hear you: French learners experiencing intelligibility through automatic speech recognition. Foreign Lang Ann 51(3):617–637. https://doi.org/10.1111/flan.12348
ArticleMATHGoogle Scholar
Neri A, Cucchiarini C, Strik H, Boves L (2002) The pedagogy-technology interface in computer assisted pronunciation training. Comput. Assist Lang Learn 15(5):441–467. https://doi.org/10.1076/call.15.5.441.13473
ArticleGoogle Scholar
Neri A, Mich O, Gerosa M, Giuliani D (2008) The effectiveness of computer-assisted pronunciation training for foreign language learning by children. Comput Assist Lang Learn 21(5):393–408. https://doi.org/10.1080/09588220802447651
ArticleGoogle Scholar
Olson DJ (2014) Benefits of visual feedback on segmental production in the L2 classroom. Lang Learn Technol 18(3):173–192
MATHGoogle Scholar
Olson, DJ (2022, June). Visual feedback and relative vowel duration in L2 pronunciation: the curious case of stressed and unstressed vowels. In J Levis & A Guskaroska (eds.), Proceedings of the 12th pronunciation in second language learning and teaching conference, held virtually at Brock University, St. Catharines, ON
Patten I, Edmonds LA (2015) Effect of training Japanese L1 speakers in the production of American English/r/using spectrographic visual feedback. Computer Assist Lang Learn 28(3):241–259. https://doi.org/10.1080/09588221.2013.839570
ArticleMATHGoogle Scholar
Pennington MC, Richards JC (1986) Pronunciation revisited. TESOL Q 20(2):207–225. https://doi.org/10.2307/3586541
ArticleGoogle Scholar
Pennington, MC, & Rogerson-Revell, P (2019). Using technology for pronunciation teaching, learning, and assessment. In: Pennington MC, Rogerson-Revell P (eds), English pronunciation teaching and research. Palgrave Macmillan, London, pp 235–286
Rogerson-Revell PM (2021) Computer-assisted pronunciation training (CAPT): current issues and future directions. RELC J 52(1):189–205. https://doi.org/10.1177/0033688220977406
ArticleGoogle Scholar
Seferoğlu G (2005) Improving students’ pronunciation through accent reduction software. Br J Educ Technol 36(2):303–316. https://doi.org/10.1111/j.1467-8535.2005.00459.x
ArticleGoogle Scholar
Small H (1973) Co-citation in the scientific literature: a new measure of the relationship between two documents. J Am Soc Inf Sci 24(4):265–269. https://doi.org/10.1002/asi.4630240406
ArticleMATHGoogle Scholar
Solmi M, Daure CC, Buot A, Ljuslin M, Verroust V, Mallet L, Khazaal Y, Rothen S, Thorens G (2022) A century of research on psychedelics: a scientometric analysis on trends and knowledge maps of hallucinogens, entactogens, entheogens and dissociative drugs. Eur Neuropsychopharmacol 64:44–60
CASPubMedGoogle Scholar
Sousa DA (2010) How the ELL brain learns. Corwin Press
Tseng WT, Chen S, Wang SP, Cheng HF, Yang PS, Gao XA (2022) The effects of MALL on L2 pronunciation learning: a meta-analysis. J Educ Comput Res 60(5):1220–1252. https://doi.org/10.1177/07356331211058662
ArticleMATHGoogle Scholar
Warschauer M, Healey D (1998) Computers and language learning: an overview. Lang Teach 31(2):57–71. https://doi.org/10.1017/S0261444800012970
ArticleMATHGoogle Scholar
Wong R (1987) Teaching pronunciation: focus on English rhythm and intonation. Prentice-Hall, London, UK
MATHGoogle Scholar
Download references
Acknowledgements
The authors would like to thank the Deanship of Scientific Research and Graduate Studies at Najran University for funding this project under the grant No. (NU/GP/SEHRC/13/186).
Author information
Authors and Affiliations
College of Languages and Translation, Najran University, Najran, Saudi Arabia
Mohammed Ali Mohsen, Sultan Hassan AlThebi & Abdulaziz Alsharani
Department of English, Arab Open University, Dammam, Saudi Arabia
Hassan Saleh Mahdi
Department of English, Taif University, Taif, Saudi Arabia
Reem Alkhammash
Authors
Mohammed Ali Mohsen
View author publications
You can also search for this author inPubMedGoogle Scholar
2. Hassan Saleh Mahdi
View author publications
You can also search for this author inPubMedGoogle Scholar
3. Sultan Hassan AlThebi
View author publications
You can also search for this author inPubMedGoogle Scholar
4. Reem Alkhammash
View author publications
You can also search for this author inPubMedGoogle Scholar
5. Abdulaziz Alsharani
View author publications
You can also search for this author inPubMedGoogle Scholar
Contributions
Mohammed Mohsen; project administration, visualization, data curation, writing first draft, methodology. Hassan Saleh Mahdi: conceptualization, methodology, data curation, Writing. Sultan Althebi: data curation, validation, sources, visualization. Reem Alkhammash: data curation, methodology, writing. Abdulaziz Alshahrani: Validation, funding acquisition, data curation.
Corresponding author
Correspondence to Mohammed Ali Mohsen.
Ethics declarations
Competing interests
The authors declare no competing interests.
Ethical approval
This article does not contain any studies with human participants performed by any of the authors.
Informed consent
Ethical approval was not required as the study did not involve human participants.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Materials
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
Reprints and permissions
About this article
Check for updates. Verify currency and authenticity via CrossMark
Cite this article
Mohsen, M.A., Mahdi, H.S., AlThebi, S.H. et al. A scientometric study of computer-assisted pronunciation training in second language acquisition: technological affordances and research trends. Humanit Soc Sci Commun 12, 438 (2025). https://doi.org/10.1057/s41599-025-04474-y
Download citation
Received:09 July 2024
Accepted:29 January 2025
Published:27 March 2025
DOI:https://doi.org/10.1057/s41599-025-04474-y
Share this article
Anyone you share the following link with will be able to read this content:
Get shareable link
Sorry, a shareable link is not currently available for this article.
Copy to clipboard
Provided by the Springer Nature SharedIt content-sharing initiative