In Content and Language Integrated Learning (CLIL), an educational approach characterized by Coyle (2007, p. 545) as ‘where both language and content are conceptualized on a continuum without an implied preference for either’, subject contents are taught through a foreign language as the medium of instruction (Coyle et al., 2010; Lasagabaster & Sierra, 2010). The last two decades have witnessed an increasing number of CLIL programmes across Europe and beyond, from primary and secondary education to universities, as a response to the demand for improvement in foreign language teaching (Dafouz et al., 2014; Merino & Lasagabaster, 2015; Wei & Feng, 2015; Yang, 2015). Along with the boom of the CLIL practice, many studies concerning its outcomes have been conducted. Unfortunately, it appears that those studies have been outpaced by the rapid expansion of CLIL (Merino & Lasagabaster, 2015). The inconsistent results emerging from accumulating studies have not been fully examined. Some scholars (e.g., Dalton-Puffer, 2011; Merino & Lasagabaster, 2015) suggest that CLIL programmes bring benefits to learners, but the results are not always optimistic (Yang, 2015). Inconsistent outcomes concerning the effectiveness of CLIL have been found in both linguistic (e.g., listening, writing) and non-linguistic (e.g., motivation, content achievement) areas.
Against this background, the present article aims to critically review CLIL outcomes in previous quantitative research, from the perspective of methodological rigour which advocates fuller use of effect sizes (ESs) and less reliance upon the statistical significance (viz. p value). This article attempts to contribute to a better understanding of the inconsistent findings concerning the outcomes of CLIL. In quantitative research, the magnitude of difference or the strength of correlation, which is reflected by ES rather than the statistical significance (viz. p value), should be focussed on (Cohen, 1990). The importance of ES over p value has also been highlighted in recent years. For example, the use of p resulting from the null hypothesis significance testing procedure has been banned by the editors of Basic and Applied Social Psychology, who ‘require strong descriptive statistics, including effect sizes’ (Trafimow & Marks, 2015, p. 1). However, the importance of ES has not been widely recognized and put into practice in the field of applied linguistics in general (Kong & Wei, 2019; Wei et al., 2019) and CLIL in particular. Accordingly, the present article will illustrate the importance of ESs with some typical studies about CLIL outcomes which unfortunately fail to make fuller use of ESs and/or rely too much on p. Such examination of methodological rigour is of significance to CLIL research because, as the review below will show, misunderstanding of ESs and p values can lead to misinterpretation of CLIL outcomes. This article also aims to provide pedagogical suggestions for EAP practitioners. The following review of CLIL outcomes is structured around two themes: encouraging fuller use (i.e., reporting and interpreting) of ESs and less reliance upon p.
Research of CLIL outcomes and ES
In connection with reporting, ESs should be reported both when results are statistically significant and when they are not (Larson-Hall & Plonsky, 2015; Sun et al., 2010). It is not uncommon to see scholars only reporting p values and omitting ESs in CLIL research (e.g., Nieto Moreno de Diezmas & Matthew, 2019). Likewise, some researchers (e.g., Pérez Cañado & Lancaster, 2017) only report ESs when the results were statistically significant.
In terms of ES interpretation, four points merit attention. First, scholars should always specify which ES benchmark system they draw upon to avoid ambiguity. Studies on CLIL outcomes which fail to specify the benchmark system include Prieto-Arranz et al. (2015) and Roquet et al. (2016).
Second, it is more advisable to draw on a topic-specific ES benchmark system. For instance, Cohen’s (1988) benchmarks (e.g., for Pearson’s r, .10 as small, .30 medium, and .50 large) used ‘by too many researchers as iron-clad criteria’ (Wei et al., 2019, p. 2), are based on the ESs usually found in the studies of behavioural sciences, so they ‘do not have absolute meaning and are only relative to typical findings in these areas’ (Leech et al., 2005, p. 56). An ES benchmark most relevant to the focal research topic should be prioritized. For example, Martínez Agudo (2021) examined the strength of association between motivational variables and learning competence in natural science among 156 CLIL learners and their 162 non-CLIL counterparts. Based on the correlational strength (r = .222 and -.212 respectively) this researcher claimed to find a ‘weak degree of correlation’ (p. 236) between two (i.e., self-demand and lack of interest) of the motivational variables and CLIL students’ content achievement. Motivational variables were measured with Pelechano’s (1994) motivation instrument, and the performance in natural science with final grades obtained from the schools. Considering that this is a survey-based study, Wei and Hu’s (2019) ES interpretation system is recommended (Wei et al., 2019). In this more topic-specific interpretation system, the small, medium, large, and very large cut-offs for the ES index R2 are .005, .01, .02, and .09 respectively; when R2 is un-squared (r), the corresponding cut-offs are (roughly) .07, .10, .14, and .30. According to this system, r = .222 and r = -.212 from Martínez Agudo (2021) both fell between the large and very large benchmarks. Therefore, the correlation between the two motivational variables and the content-subject learning performance can be reinterpreted as large. This is very different from several conclusions, such as ‘a low correlation between the motivational variables scrutinized and content achievement levels in CLIL settings’ (p. 237).
Third, scholars should value ESs over ps when both are reported. Apart from judging whether or not the result is statistically significant according to p, researchers should rely more on ES benchmark systems to provide interpretations because ES is much more important than p (Larson-Hall, 2016; Wei et al., 2019). Hence researchers’ failure to interpret ESs can potentially diminish the importance of their findings (Wei & Hu, 2019). For instance, comparing the CLIL effect on the overall English proficiency between the CLIL-group (208 high school students who attended 3.4 CLIL sessions weekly on average) and the CLIL+ group (108 high school students who attended 8.4 CLIL sessions weekly on average) after one year, Merino & Lasagabaster (2017) concluded that ‘the evolution of CLIL+ students was significantly higher than that of their CLIL-counterparts (with a small r effect size: r = 0.27)’ (p. 8) (p > .001) and ‘the CLIL-participants progressed less than the CLIL+ participants’ (p. 9). It appears that the authors only used p to reach the conclusion. If the ‘small’ effect size is taken into consideration, the following could be a different interpretation: more exposure to CLIL only resulted in small progress, which might be neglected in practice.
Similarly, Pavón Vázquez (2018), without drawing up the reported ES (d = -.978), purported that there were no statistically significant differences (p = .051) in English pronunciation between thirty rural students and five urban students in the sixth grade of Spanish primary education who received CLIL. In the words of renowned statisticians, ‘Surely, God loves the 0.06 nearly as much as the 0.05’ (Rosnow & Rosenthal, 1989, p. 1277). Researchers should not simply dismiss the above magnitude of differences because p was slightly higher than the conventional cut-off (.05). A more reliable approach is to interpret the results based on ESs. When Cohen’s (1988) system was adopted, the above inter-group differences could be regarded as ‘large’ because d (-.978) exceeded Cohen’s ‘large’ benchmark (.80). Pavón Vázquez (2018) argued that pronunciation was one of the very few linguistic sub-aspects in which the urban and rural participants levelled out; however, the rather large d supported the following reinterpretation: the overall performance of urban students in terms of foreign language competence was better than that of the rural cohorts.
Fourth, whenever possible, researchers should adequately interpret ESs. One fruitful way is comparing the magnitudes of their ESs with those from previous research literature (Larson-Hall & Plonsky, 2015). When interpretating the difference between the CLIL and the non-CLIL groups regarding oral comprehension at the second post-test phase, Pérez Cañado & Lancaster (2017) claimed to find no inter-group difference based on p (.069). However, this large p was probably attributable to the very small sample size (only twelve students per group). The recalculated ES (d = .78) indicates that the CLIL cohorts outperformed its counterpart in oral comprehension; when interpreted according to Cohen’s (1988) system, this d reveals a large effect because this ES after rounding reaches the ‘large’ benchmark. For adequate interpretation of the ES, the authors could have compared that ES to those in the studies also examining the differences in terms of listening competence between CLIL and non-CLIL students (e.g., d = 1.09 after recalculation in Lasagabaster (2008) - one of the studies reviewed in Pérez Cañado & Lancaster (2017)), which may trigger questions regarding which factors cause the differences in the outcomes of CLIL across related studies. Such comparison is a good practice for interpreting ESs, helping inform the practical meaning of ESs (Sun et al., 2010).
Research of CLIL outcomes and p
In addition to making fuller use of ESs, less reliance on p should be highlighted. First, researchers are expected to report exact p values rather than ‘p > .05’ and the like. This is because the latter reporting practice subscribes to the dichotomous thinking that the result is either statistically significant or not (Larson-Hall & Plonsky, 2015; Sun et al., 2010). The dichotomous thinking compromises ‘the richness of information’ (Sun et al., 2010, p. 1001) in p: the smaller the p value is, the less possible its associated sample is in the sampling distribution. Furthermore, this thinking can result in underestimating the importance of ESs (Larson-Hall & Plonsky, 2015). The p value can be easily influenced by the sample size: it is common to see large (statistically non-significant) p values associated with a small sample size and small (statistically significant) p values with a big sample size. For instance, in Merino & Lasagabaster (2015), the difference in reading comprehension was not statistically significant (p > .05, see their Table 8) between twenty-four CLIL students and thirty-two non-CLIL ones; this could simply be attributed to the small sample size in addition to the reasons listed by the authors; in Várkuti’s (2010) study involving 816 CLIL students and 631 non-CLIL students, it is not surprising to see all the reported p were ‘p = 0.000’ as the sample size was rather large. As p is highly dependent on sample size, better interpretation needs to be based more on ESs and less on ps.
The second suggestion to reduce reliance upon p is that the word ‘statistically’ should be used to modify ‘(non-)significant(ly)’. ‘[T]he tendency to drop the word statistically and use significant difference instead of statistically significant difference in research reports’ can reflect the most common misconception that ‘statistically significant means important’ (Nassaji, 2012, p. 95). Some scholars have even proposed that ‘the term significance should be removed from the statistics vocabulary’ (Nassaji, 2012, p. 96). For instance, Kline (2004) (also cited in Larson-Hall, 2016) suggested using ‘statistical’ to denote a ‘statistically significant’ result and maintaining ‘significant’ in its original meaning of ‘important’. There are often cases in which researchers mistake ‘(non-)significant’ for ‘practically (un)important’ (e.g., Nieto Moreno de Diezmas, 2018; Nieto Moreno de Diezmas & Matthew, 2019; Prieto-Arranz et al., 2015; Várkuti, 2010). For example, Nieto Moreno de Diezmas (2018) reported that the overall score of digital competence and the scores of most of its sub learning standards of CLIL participants were ‘significantly higher’ (p. 81) ‘since p = .000’ (p. 79) and reached the conclusion that ‘the CLIL programme was more productive (than the mainstream programme) for learning digital competence’ (p. 81).
Conclusion and implications
This article from the perspective of methodological rigour has revealed several problems in the existing research on CLIL outcomes and proposed some suggestions for future CLIL research. Due to the inadequate use of ES and/or the over-reliance upon p, some research outcomes (e.g., differences between CLIL and non-CLIL students) have been unfortunately misinterpreted. As CLIL is burgeoning across Europe and beyond, more methodologically rigorous research needs to be carried out in order to paint a more holistic and reliable picture of the effectiveness of CLIL.
This article can provide some practical suggestions for EAP practitioners, who have the power to contribute to improving research practices. EAP practitioners often serve as trainers (e.g., EAP teachers teaching publication skills, see Li & Cargill, 2021) of (novice) researchers aspiring to publish their work in peer-reviewed publications. Given the importance of ESs in quantitative research discussed above, EAP practitioners can engage their postgraduate students and trainees (e.g., the above-mentioned researchers) in making use of ESs in their teaching practices in order to enhance students’ and trainees’ awareness of methodological rigour and improve their reporting practices. Improved reporting practices will increase those students’ and trainees’ chances to publish their quantitative research in peer-reviewed publications. Furthermore, explicit guidelines for reporting and interpreting ESs can be included in EAP teaching materials. But the development of such teaching materials might take time; in the absence of such materials, EAP practitioners should consider emphasizing ESs when teaching their postgraduate students and novice researchers.