Visible Learning by John Hattie (2009) is well-described by its subtitle: “A synthesis of over 800 meta-analyses relating to achievement”. It is an attempt to summarise a huge amount of educational research about what works and what doesn’t into a single volume. This post is my review/analysis of the book.
The concept of the book sounded excellent: a meta-meta-analysis. To unpack that: a single study might look at some students and the effect that a particular change (e.g. requiring students to have an iPad) had on their achievement. But a single study can be biased by having a small sample, or various confounds (e.g. students’ familiarity with iPads). A meta-analysis looks across many studies of a particular change, to try to get a more reliable sense of the effect. Hattie’s Visible Learning aimed to be a meta-meta-analysis: a look at a huge array of meta-analyses to get a good picture of what works in education and what doesn’t.
The effort is massive, and commendable. However: I believe that the methodology is frequently flawed. Hattie’s book relies on averaging and comparing a statistical measure called “effect size” across meta-analyses. I am not a statistics expert, but my best understanding — explained, and supported with examples below — is that Hattie repeatedly averages and compares effect sizes where it is inappropriate to do so. Because of this, I am not confident in the conclusions that the book draws. Read on for more details of what effect sizes are, how you should handle them, and the problems with how Hattie processes them.
Visible Learning is centred around a statistical measure called “effect size”, and so are my criticisms. There are several measures of effect size — Hattie uses one of the most popular: Cohen’s d. Simply, it’s a measure of how much two groups differ on a particular measurement. It’s defined as:
You can read this as a measure of how many standard deviations the mean was shifted. For example, imagine you have a class learning French, and you take them to France on a two week trip. Beforehand, their average score on a French test is 5.5 out of 10, after the trip it’s 7 out of 10, and the standard deviation is 2 marks in both cases. This corresponds to an effect size of 0.75 (): the students’ score has increased by three quarters of a standard deviation.
Inappropriate averaging and comparison
A meta-analysis is a way to rigorously and quantitatively summarise a bunch of research studies, usually by way of effect sizes. Typically, you find a set of studies measuring a similar outcome, with a similar manipulation, find their effect sizes and average all the effect sizes to get a mean effect size. Crucially, there is an onus on the authors to not just average every study they find. Here’s Coe (2002) discussing effect sizes in meta-analyses (emphasis mine):
One final caveat should be made here about the danger of combining incommensurable results. Given two (or more) numbers, one can always calculate an average. However, if they are effect sizes from experiments that differ significantly in terms of the outcome measures used, then the result may be totally meaningless. It can be very tempting, once effect sizes have been calculated, to treat them as all the same and lose sight of their origins… In comparing (or combining) effect sizes, one should therefore consider carefully whether they relate to the same outcomes… one should also consider whether those outcome measures are derived from the same (or sufficiently similar) instruments and the same (or sufficiently similar) populations.
These are important points, and from reading Visible Learning I am not confident that Hattie has followed this advice. My criticisms in the rest of the review relate to inappropriate averaging and comparison of effect sizes across quite different studies and interventions. This is not a minor issue: the whole book is built on the idea that we can look across these meta-analyses to find the most effective interventions by comparing average effect sizes.
Averaging measures of different outcomes
One oft-debated question is: does gender matter to achievement? Hattie concludes that gender makes no difference: “The average effect size across 2745 effect sizes is 0.12 (in favour of males).” But what studies have been used to make the average? One of the gender papers is “How large are cognitive gender differences?” by Hyde (1981). It has two findings from its meta-analysis: girls have higher verbal ability with effect size 0.24 (so boys “ahead” by -0.24), but boys have higher quantitative ability with effect size 0.43. Hattie puts these values as -0.24 and 0.43 into his effect size average — if it was these two alone we would get 0.1 effect size, and agree with Hattie that there is no gender difference. But the effect sizes are tests of different ability; averaging them loses vital detail. The correct thing to do (as Hyde did) is to calculate two separate effect sizes, one for each test.
Invalid effect size comparisons
Hattie sets out at the start of the book to decide which effect sizes are worth paying attention to. He chooses 0.4 as his benchmark value for two main reasons. The first reason is that 0.4 is the average effect size of all the meta-analyses included in the book, so being above this threshold means the effects are in the top half of those covered in the book. That relies on the assumption that all the effect sizes can be compared to each other, which is problematic, as we will see below. (And Hattie does make the assumption: the book ends with a giant table ranking all the effect sizes against each other.)
The second reason that Hattie uses 0.4 as his benchmark is that the average improvement year-on-year among students is an effect size of 0.2 to 0.4. Hattie argues that an effect size of, say, 0.1 would be a decrease on the normal academic progression, and we should aim to beat the 0.4 mark. Which might be fine, for studies that are one year long and measure improvement in students across all subjects during one school year.
But let’s take the summer school statistics as an outcome. The difference between students before and after a summer school has an effect size of 0.23. If we are directly comparing effect sizes then we are left with the message that summer schools produce a similar gain in achievement to the rest of the year’s schooling. Really? If this logic were correct, we could surely shut the schools for nine months, send the students to a single summer school each year, and not reduce their academic achievement! This example shows that Hattie’s assumption that we can compare effect sizes across all these different studies produces some nonsensical implications. Clearly, you can’t compare the effect size for a summer school to a full year’s schooling to decide if they are worthwhile.
In this case, I believe that a large part of the problem there is that the measure of academic achievement between school years is based on a widespread battery of tests, but this summer schools result tends to be based on a fairly specialised outcome, which is therefore easier to make gains on. In fact, looking at the Cooper et al. (2000) summer-school analyses that make up two of the three summer school meta-analyses in Visible Learning, one of them measures primarily reading and writing, while the other measures a mix of things including maths, but also self-esteem and attitudes to achievement. Here, I think the fault lies more in the original meta-analysis (combining different measures), but Hattie’s result is still suspect if it is based on bad data.
As another example, Hattie discusses the effect on achievement of the 3-month summer break in school in the USA: an effect size of -0.09 on student achievement. Hattie states that “Compared to all other effects, these are minor indeed.” This doesn’t seem to tally. If the effect size for a standard year’s schooling (above) is 0.2 to 0.4, then doesn’t this -0.09 effect size roughly mean that students are regressing between 25% and 50% of their school year’s achievement gains during the summer break? This doesn’t seem minor at all! Either Hattie dismisses this problem too easily, or the problem is that we can’t compare the effect size for the school year to the regression over the summer.
Comparing X/Y tests to pre-/post-tests
Another problem is that Hattie is mixing and matching all sorts of different studies. Some compare the pre-scores of students to post-scores: for example, before and after a summer school. Other studies compare students in treatment X (e.g. selective schools) to treatment Y (e.g. mixed ability schools). We would have different expectations of an effect in each case.
As in the diagram above, if we looked at the pre-/post- effect of a year of home schooling, we would be expecting at least 0.4 (since a normal year of schooling produces an effect size of 0.4). But if we compare students who have been home schooled against students who attended state schools, an effect size of 0.1 may well be worth it. But Hattie does not distinguish between these two origins of effect sizes when he ranks the effect sizes of different studies against each other. He is again comparing apples and oranges — a criticism covered in more detail here.
Increasing effect size by averaging split samples
Let’s return to the summer school example, and those two meta-analyses: one concerns remedial summer schools and one concerns “acceleration” summer schools for the gifted. This restriction of the samples is important — I want to explain the principle here, even though I’ve already pointed out that the latter meta-analysis measured something other than achievement.
If you look at an intervention for all students, they might have a standard deviation of test scores of, say, 15 percentage points. With a mean improvement of 5 percentage points on the test, you’ll get an effect size of 0.33 (5/15). If you look just at your most gifted students, they are likely to have a smaller standard deviation of test scores, say 5 percentage points, because they’re all at the top end with similar abilities. Now, a mean improvement of 5 percentage points produces an effect size of 1 (5/5). Similarly, a remedial summer school for your lowest achieving students will see them tightly clustered at the low end, so a standard deviation of, say, 10 percentage points with an improvement of 5 would score an effect size of 0.5 (5/10). So by using meta-analyses which have split samples, the effect sizes have gone up from 0.33 to an average of 0.75. But, as with the aforementioned gender paper, no consideration is given to this by Hattie: the two summer school meta-analyses get averaged together, into what is likely to be an inflated effect size. This is not a problem with the original meta-analyses per se, it is a problem with Hattie’s inconsiderate averaging of their results.
Effect sizes from different measures
The section “self-reported grades” reports an effect size of 1.44 — the highest effect size in the book. Worth paying attention to — but it took me a while to decipher what, precisely, the effect was referring to. I looked at one cited paper, “The Validity of Self-Reported Grade Point Averages, Class Ranks, and Test Scores: A Meta-Analysis and Review of the Literature” by Kuncel, Credé and Thomas (2005), which was listed as an effect size of 3.10 (huge!). Kuncel et al don’t use effect sizes directly; they are measuring correlations between a student’s actual grade and the self-reported grade. There is a conversion from correlation to effect size, but this is not a measure of the difference between two groups (like most of the rest of the book), it is a measure of matching. So the highest effect size in the book is not some miraculous intervention, it’s a measure of people’s honesty (who would have thought!), but Hattie ranks this in a table with things like effects of summer schools and difference between single- and mixed-sex schooling. Once again, the comparison makes no sense.
Effect sizes are vulnerable to the changes in standard deviation which occur in schools
Effect size is the mean difference divided by the standard deviation. Therefore, a change in standard deviation will affect the effect size. One complication in education is that standard deviation of ability tends to increase as you look at older students. As David Weston points out, effect sizes will therefore tend to get smaller as you look at later student years. So interventions at a very early age will appear to have a much larger effect (in terms of effect size) than interventions at a late age. It is not a problem specific to Hattie, but it yet again calls into question the practice of averaging and comparing effect sizes across interventions which may involve different age groups.
The noticeable error
There is additionally a noticeable error that crops up early in the book, on page 9. To help in understanding effect sizes more easily, Hattie added common language effect sizes (CLE) to the book. To re-use his example: to summarise the height advantage of adult males over adult females, the effect size is 2.0, but the CLE re-expresses this to say that there is a 92% chance that a randomly picked adult male is taller than a randomly picked adult female. Nice idea.
Unfortunately, that example is the only CLE in the book that is correct. All CLEs in tables throughout the book are wrong. CLE is a probability; Hattie’s tables have CLEs that are below 0% or higher than 100%. This post spotted the same problem and says the cause lay in Hattie’s algorithm that was calculating the CLE. I’m a programmer, I can see how this might happen when processing a large data set automatically, but there are also several CLEs described in text that are wrong (pages 9, 42 and 220), which is less forgiveable. It’s more of a side issue than all the problems discussed above, but at the time it wasn’t an inspiring start to the book.
The book is difficult to read from cover to cover, but I could see that it would be useful as a reference. You want to know about a particular intervention, say mixed-sex versus single-sex schools, and you flick to the appropriate section to see the conclusion: no major effect. You look up tracking (aka setting or streaming by ability): no major effect. However, given all above the methodological issues, I do not feel that I can trust any of the averaged effect size results in this book without digging further into the original meta analyses to check that they have been combined appropriately.
These are not side issues: they are core problems with the approach of the book. The averaging of effect sizes across meta analyses (and then comparing these averages) is the key technique by which Hattie judges what works and what doesn’t, and thus forms his narrative about what is important in education. For example, on page 243, Hattie compares the average effect sizes for the “teacher as activator” techniques he has analysed against those for “teacher as facilitator”. On the basis that the former are higher, he concludes “These results show that active and guided instruction is much more effective than unguided, facilitative instruction”. He’s not necessarily wrong, but if we cannot trust the average effect sizes he gives as evidence, and cannot sensibly compare them, we cannot make that conclusion from this data. In which case, the book is not much use as an argument or a useful summary of the data, just as an impressive catalogue of the original meta-analyses.
Please do add comments if you disagree with any of this. As I say, I am not a statistics expert, but I’ve tried to explain my logic above so that you can follow my thinking for yourself.