# Book Review: Visible Learning

Visible Learning by John Hattie (2009) is well-described by its subtitle: “A synthesis of over 800 meta-analyses relating to achievement”. It is an attempt to summarise a huge amount of educational research about what works and what doesn’t into a single volume. This post is my review/analysis of the book.

The concept of the book sounded excellent: a meta-meta-analysis. To unpack that: a single study might look at some students and the effect that a particular change (e.g. requiring students to have an iPad) had on their achievement. But a single study can be biased by having a small sample, or various confounds (e.g. students’ familiarity with iPads). A meta-analysis looks across many studies of a particular change, to try to get a more reliable sense of the effect. Hattie’s Visible Learning aimed to be a meta-meta-analysis: a look at a huge array of meta-analyses to get a good picture of what works in education and what doesn’t.

The effort is massive, and commendable. However: I believe that the methodology is frequently flawed. Hattie’s book relies on averaging and comparing a statistical measure called “effect size” across meta-analyses. I am not a statistics expert, but my best understanding — explained, and supported with examples below — is that Hattie repeatedly averages and compares effect sizes where it is inappropriate to do so. Because of this, I am not confident in the conclusions that the book draws. Read on for more details of what effect sizes are, how you should handle them, and the problems with how Hattie processes them.

### Effect Sizes

Visible Learning is centred around a statistical measure called “effect size”, and so are my criticisms. There are several measures of effect size — Hattie uses one of the most popular: Cohen’s d. Simply, it’s a measure of how much two groups differ on a particular measurement. It’s defined as:

$\displaystyle d = \frac{\text{The difference in means}}{\text{The standard deviation}}$

You can read this as a measure of how many standard deviations the mean was shifted. For example, imagine you have a class learning French, and you take them to France on a two week trip. Beforehand, their average score on a French test is 5.5 out of 10, after the trip it’s 7 out of 10, and the standard deviation is 2 marks in both cases. This corresponds to an effect size of 0.75 ($(7-5.5)/2$): the students’ score has increased by three quarters of a standard deviation.

### Inappropriate averaging and comparison

A meta-analysis is a way to rigorously and quantitatively summarise a bunch of research studies, usually by way of effect sizes. Typically, you find a set of studies measuring a similar outcome, with a similar manipulation, find their effect sizes and average all the effect sizes to get a mean effect size. Crucially, there is an onus on the authors to not just average every study they find. Here’s Coe (2002) discussing effect sizes in meta-analyses (emphasis mine):

One final caveat should be made here about the danger of combining incommensurable results. Given two (or more) numbers, one can always calculate an average. However, if they are effect sizes from experiments that differ significantly in terms of the outcome measures used, then the result may be totally meaningless. It can be very tempting, once effect sizes have been calculated, to treat them as all the same and lose sight of their origins… In comparing (or combining) effect sizes, one should therefore consider carefully whether they relate to the same outcomes… one should also consider whether those outcome measures are derived from the same (or sufficiently similar) instruments and the same (or sufficiently similar) populations.

These are important points, and from reading Visible Learning I am not confident that Hattie has followed this advice. My criticisms in the rest of the review relate to inappropriate averaging and comparison of effect sizes across quite different studies and interventions. This is not a minor issue: the whole book is built on the idea that we can look across these meta-analyses to find the most effective interventions by comparing average effect sizes.

### Averaging measures of different outcomes

One oft-debated question is: does gender matter to achievement? Hattie concludes that gender makes no difference: “The average effect size across 2745 effect sizes is 0.12 (in favour of males).” But what studies have been used to make the average? One of the gender papers is “How large are cognitive gender differences?” by Hyde (1981). It has two findings from its meta-analysis: girls have higher verbal ability with effect size 0.24 (so boys “ahead” by -0.24), but boys have higher quantitative ability with effect size 0.43. Hattie puts these values as -0.24 and 0.43 into his effect size average — if it was these two alone we would get 0.1 effect size, and agree with Hattie that there is no gender difference. But the effect sizes are tests of different ability; averaging them loses vital detail. The correct thing to do (as Hyde did) is to calculate two separate effect sizes, one for each test.

### Invalid effect size comparisons

Hattie sets out at the start of the book to decide which effect sizes are worth paying attention to. He chooses 0.4 as his benchmark value for two main reasons. The first reason is that 0.4 is the average effect size of all the meta-analyses included in the book, so being above this threshold means the effects are in the top half of those covered in the book. That relies on the assumption that all the effect sizes can be compared to each other, which is problematic, as we will see below. (And Hattie does make the assumption: the book ends with a giant table ranking all the effect sizes against each other.)

The second reason that Hattie uses 0.4 as his benchmark is that the average improvement year-on-year among students is an effect size of 0.2 to 0.4. Hattie argues that an effect size of, say, 0.1 would be a decrease on the normal academic progression, and we should aim to beat the 0.4 mark. Which might be fine, for studies that are one year long and measure improvement in students across all subjects during one school year.

But let’s take the summer school statistics as an outcome. The difference between students before and after a summer school has an effect size of 0.23. If we are directly comparing effect sizes then we are left with the message that summer schools produce a similar gain in achievement to the rest of the year’s schooling. Really? If this logic were correct, we could surely shut the schools for nine months, send the students to a single summer school each year, and not reduce their academic achievement! This example shows that Hattie’s assumption that we can compare effect sizes across all these different studies produces some nonsensical implications. Clearly, you can’t compare the effect size for a summer school to a full year’s schooling to decide if they are worthwhile.

In this case, I believe that a large part of the problem there is that the measure of academic achievement between school years is based on a widespread battery of tests, but this summer schools result tends to be based on a fairly specialised outcome, which is therefore easier to make gains on. In fact, looking at the Cooper et al. (2000) summer-school analyses that make up two of the three summer school meta-analyses in Visible Learning, one of them measures primarily reading and writing, while the other measures a mix of things including maths, but also self-esteem and attitudes to achievement. Here, I think the fault lies more in the original meta-analysis (combining different measures), but Hattie’s result is still suspect if it is based on bad data.

As another example, Hattie discusses the effect on achievement of the 3-month summer break in school in the USA: an effect size of -0.09 on student achievement. Hattie states that “Compared to all other effects, these are minor indeed.” This doesn’t seem to tally. If the effect size for a standard year’s schooling (above) is 0.2 to 0.4, then doesn’t this -0.09 effect size roughly mean that students are regressing between 25% and 50% of their school year’s achievement gains during the summer break? This doesn’t seem minor at all! Either Hattie dismisses this problem too easily, or the problem is that we can’t compare the effect size for the school year to the regression over the summer.

### Comparing X/Y tests to pre-/post-tests

Another problem is that Hattie is mixing and matching all sorts of different studies. Some compare the pre-scores of students to post-scores: for example, before and after a summer school. Other studies compare students in treatment X (e.g. selective schools) to treatment Y (e.g. mixed ability schools). We would have different expectations of an effect in each case.

In pre-/post- tests (over time, horizontally) you would expect to see a decent effect size: 0.4 for the year as Hattie advises. But in X/Y tests where you are comparing two different groups at the end of a period (vertically, above), looking for 0.4 because it is the effect of a year’s schooling doesn’t make sense.

As in the diagram above, if we looked at the pre-/post- effect of a year of home schooling, we would be expecting at least 0.4 (since a normal year of schooling produces an effect size of 0.4). But if we compare students who have been home schooled against students who attended state schools, an effect size of 0.1 may well be worth it. But Hattie does not distinguish between these two origins of effect sizes when he ranks the effect sizes of different studies against each other. He is again comparing apples and oranges — a criticism covered in more detail here.

### Increasing effect size by averaging split samples

Let’s return to the summer school example, and those two meta-analyses: one concerns remedial summer schools and one concerns “acceleration” summer schools for the gifted. This restriction of the samples is important — I want to explain the principle here, even though I’ve already pointed out that the latter meta-analysis measured something other than achievement.

If you look at an intervention for all students, they might have a standard deviation of test scores of, say, 15 percentage points. With a mean improvement of 5 percentage points on the test, you’ll get an effect size of 0.33 (5/15). If you look just at your most gifted students, they are likely to have a smaller standard deviation of test scores, say 5 percentage points, because they’re all at the top end with similar abilities. Now, a mean improvement of 5 percentage points produces an effect size of 1 (5/5). Similarly, a remedial summer school for your lowest achieving students will see them tightly clustered at the low end, so a standard deviation of, say, 10 percentage points with an improvement of 5 would score an effect size of 0.5 (5/10). So by using meta-analyses which have split samples, the effect sizes have gone up from 0.33 to an average of 0.75. But, as with the aforementioned gender paper, no consideration is given to this by Hattie: the two summer school meta-analyses get averaged together, into what is likely to be an inflated effect size. This is not a problem with the original meta-analyses per se, it is a problem with Hattie’s inconsiderate averaging of their results.

### Effect sizes from different measures

The section “self-reported grades” reports an effect size of 1.44 — the highest effect size in the book. Worth paying attention to — but it took me a while to decipher what, precisely, the effect was referring to. I looked at one cited paper, “The Validity of Self-Reported Grade Point Averages, Class Ranks, and Test Scores: A Meta-Analysis and Review of the Literature” by Kuncel, Credé and Thomas (2005), which was listed as an effect size of 3.10 (huge!). Kuncel et al don’t use effect sizes directly; they are measuring correlations between a student’s actual grade and the self-reported grade. There is a conversion from correlation to effect size, but this is not a measure of the difference between two groups (like most of the rest of the book), it is a measure of matching. So the highest effect size in the book is not some miraculous intervention, it’s a measure of people’s honesty (who would have thought!), but Hattie ranks this in a table with things like effects of summer schools and difference between single- and mixed-sex schooling. Once again, the comparison makes no sense.

### Effect sizes are vulnerable to the changes in standard deviation which occur in schools

Effect size is the mean difference divided by the standard deviation. Therefore, a change in standard deviation will affect the effect size. One complication in education is that standard deviation of ability tends to increase as you look at older students. As David Weston points out, effect sizes will therefore tend to get smaller as you look at later student years. So interventions at a very early age will appear to have a much larger effect (in terms of effect size) than interventions at a late age. It is not a problem specific to Hattie, but it yet again calls into question the practice of averaging and comparing effect sizes across interventions which may involve different age groups.

### The noticeable error

There is additionally a noticeable error that crops up early in the book, on page 9. To help in understanding effect sizes more easily, Hattie added common language effect sizes (CLE) to the book. To re-use his example: to summarise the height advantage of adult males over adult females, the effect size is 2.0, but the CLE re-expresses this to say that there is a 92% chance that a randomly picked adult male is taller than a randomly picked adult female. Nice idea.

Unfortunately, that example is the only CLE in the book that is correct. All CLEs in tables throughout the book are wrong. CLE is a probability; Hattie’s tables have CLEs that are below 0% or higher than 100%. This post spotted the same problem and says the cause lay in Hattie’s algorithm that was calculating the CLE. I’m a programmer, I can see how this might happen when processing a large data set automatically, but there are also several CLEs described in text that are wrong (pages 9, 42 and 220), which is less forgiveable. It’s more of a side issue than all the problems discussed above, but at the time it wasn’t an inspiring start to the book.

### Summary

The book is difficult to read from cover to cover, but I could see that it would be useful as a reference. You want to know about a particular intervention, say mixed-sex versus single-sex schools, and you flick to the appropriate section to see the conclusion: no major effect. You look up tracking (aka setting or streaming by ability): no major effect. However, given all above the methodological issues, I do not feel that I can trust any of the averaged effect size results in this book without digging further into the original meta analyses to check that they have been combined appropriately.

These are not side issues: they are core problems with the approach of the book. The averaging of effect sizes across meta analyses (and then comparing these averages) is the key technique by which Hattie judges what works and what doesn’t, and thus forms his narrative about what is important in education. For example, on page 243, Hattie compares the average effect sizes for the “teacher as activator” techniques he has analysed against those for “teacher as facilitator”. On the basis that the former are higher, he concludes “These results show that active and guided instruction is much more effective than unguided, facilitative instruction”. He’s not necessarily wrong, but if we cannot trust the average effect sizes he gives as evidence, and cannot sensibly compare them, we cannot make that conclusion from this data. In which case, the book is not much use as an argument or a useful summary of the data, just as an impressive catalogue of the original meta-analyses.

Please do add comments if you disagree with any of this. As I say, I am not a statistics expert, but I’ve tried to explain my logic above so that you can follow my thinking for yourself.

Filed under Uncategorized

### 12 Responses to Book Review: Visible Learning

1. Reblogged this on The Echo Chamber.

2. I would just like to add a couple of points that have bothered me. I have by no means completed a detailed analysis in the way that you have but these might potentially add something.

1. As Old Andrew has pointed out, there is an oddity the tracking data. Those programs where high ability students are extracted into a group that follows a different curriculum are not included. I don’t really understand Hattie’s logic for that.

2. I wonder about effect size and experimental design. For example, in the classic approach, one group gets a particular intervention e.g. problem-based learning and the other gets a control. However, these are not, and cannot be, blinded. So, the teacher and students may suffer from all sorts of placebo / Hawthorn effects; they expect the intervention to have an effect and it does. So far, so normal. This is one of the reasons for setting the bar at 0.4. But then these studies are compared with studies such as those on the worked example effect. In worked example studies, human mediation is limited; students are either given worked examples to study or problems to solve. They may have views about this, but it’s not as obvious what is ‘supposed’ to happen. If well designed, the enthusiasms of a teacher will have a minimal effect. Yet, we are comparing the former studies with the latter. I suspect the effect of worked examples is far more significant than this comparison and a d=0.57 effect size allows.

Thanks for an excellent post.

Harry

3. The key concern you raise seems to be: “My criticisms in the rest of the review relate to inappropriate averaging and comparison of effect sizes across quite different studies and interventions.”

You have not pointed out that Hattie addresses this issue himself on page 10: “A common criticism is that [meta-analysis] combines ‘apples with oranges’ and such combining of many seemingly disparate studies is fraught with difficulties. It is the case, however, that in the study of fruit nothing else is sensible.”

I agree that the question of comparability needs to be considered carefully when making interpretations and applying these findings to the complexities of the classroom; but so does Hattie, on the same page: ” . . . classrooms are places where complexities abound and all participants try constantly to interpret, engage or disengage, and make meaning out of this variegated landscape.” In other words, these findings are a starting point, not a definitive conclusion. See, for example, the discussion of meta-analyses of whole language approaches to reading on p.137: Hattie considers the phonics components of three “whole language” studies and what happens to the results if these phonics-linked programmes are removed. (Not good if you are a whole-language proponent.) Another example is where he considers “learning styles” meta-analyses on p.195. You cannot dismiss Hattie’s work as unreliable or lacking in interpretation or balance. Hattie claims to have read all the meta-analyses and many of the primary studies, so his consideration of the quality of studies and “how they vary across the factors we conceive as important” is something we should not dismiss lightly. This is not a superficial exercise, though it is a very large one.

Your comment about age groups and influence on effect sizes is worth bearing in mind when interpreting the results; as teachers we ought to be considered and analytical, and your reservations are appropriate. For example, Piagetian programmes with an effect size of 1.28? What does that mean? You are quite correct that we need to look much closer. That, however, is a different matter from the data itself. This study is a summary of data, not a prescription for school actions.

While it is sensible to be cautious about interpreting the results of a meta-meta-analysis, it is unfair to Hattie to suggest he expects otherwise. He has always been aware of the need for rigour, and that his work will be closely scrutinised by academics around the world. He is explicit that the results are averages, overviews, historical, and requisite of interpretation, especially when considering classroom practices. That said, he also argues the case robustly in his opening chapters for the course he has chosen. I think your piece would have been stronger had it considered his arguments.

I do think you make a very useful point about the duration of interventions when considering effect size. Hattie makes a similar point when he suggests that interventions with quite small effect sizes may be well worth implementing if they are efficient uses of resources. That is exactly how schools should look at this. Holding students back for a year, for example, or watching TV, both have very low (negative) effect sizes. Not holding students back costs nothing; however, one also has an array of possibilities as to what other interventions might be more useful. Discouraging TV in favour of homework has to be sensible. Neither of these options is expensive, but they could have sizeable benefits. On a more costly note, phonics is clearly more effective than whole language, but it may cost more – in the short term. In the long term the gains should easily pay the school back with more progress, less failure and less last-minute intervention. The data needs to be interpreted, weighed and employed judiciously to make sound strategic decisions for the benefit of students. So:

It can be disappointing for teachers that something we hoped was authoritative turns out not to be; but that is the world of research. Everything is up for re-examination. However, (in my humble opinion) nothing else comes close to the breadth and considered interpretation of Visible Learning when summing up the vast and seemingly chaotic body of educational research that has been published over the last 30 – 40 years.

• Hi Horatio,

Thanks for your long and thoughtful comment. Hattie does mention some of the issues early in the book, and you’re right to pick me up on this. But I am not sure that he deals with them adequately. In particular, in chapter 2 he does not discuss the problem of what the averages mean (and whether comparison is valid) if the studies are measuring different outcomes. He says that effect sizes can measure school achievement, but taking the Cooper et al paper as an example, that was measuring self-esteem and attitudes to school, not achievement. For how many other papers is this also true? For example, several meta-analyses in table 6.2 are not measuring achievement. I don’t think Hattie has properly covered the implications of averaging these with achievement.

And again, even if they both measure achievement, if one measures achievement across a full-curriculum test (e.g. SATs) but another measures a specific reading test, the standard deviation of the latter will be lower, so the effect size will be inflated. That’s not a problem while we are only comparing different reading interventions to each other, but we can’t compare a reading intervention to a curriculum-wide intervention. For example, in table 11.1, whole language (a reading intervention) is combined with smaller class sizes (curriculum-wide intervention). My worry is that a reader who does not appreciate some of the subtleties about where the effect sizes can and can’t be averaged and compared is encouraged to take the wrong message from the data.

I am happy with many of Hattie’s individual text summaries of the meta-analyses and I do believe that the book is a very useful catalogue of meta-analyses. Most of my review focuses on the bits I am unhappy with, which is the conclusion built on top of comparing those meta analyses, and a tendency to reduce the data to the point where too much detail is lost. I agree with your observation that: the data needs to be interpreted, weighed and employed judiciously to make sound strategic decisions for the benefit of students.

• Bullshifter

I am interested to have ended up here via links from an article mentioning Ollie Orange, in that I asked the following directly to that worthy gentleman, via his blog, and got no response; I am genuinely interested in the issue that I suppose is being looked at, and which receives some attention in these comments and your review: how exactly are we going to measure teacher effectiveness? I am not arguing against anything at this stage, I just would like to know how far off I am with this:

“Thanks for this rather interesting article. I too asked John about these things recently, and I must admit that I am still scratching my head. [...] I wonder about the ES issue in terms of what I understand Hattie to have been saying: was it not that the ES calculation allows us to determine [to what degree? we, naturally, ask] that element of the calculated *size effect* [the differences in the "scores"] that may be attributable to the educational intervention?

“Now we go to calculate the “Effect size” and we find that their test scores are really close together and they have a standard deviation of 0.5%.”

I assume that you have taken the mean of the SDs for each set of scores. Would the value that you were quoting result in a large ES because, with so little variability across the data sets, the attribution to the teaching process undertaken of having made that amount of difference would be indeed rather definite and complete?” [Or could it be that the group has collectively taken a fairly homogeneous collection of inputs that have resulted in skills that the teacher themselves is not particularly responsible for... as per the "gifted students" example.]

Or am I just a complete ignoramus? (Is this the “sigma scores/root n” issue?) If, as has been suggested in some of the critiques, ES is simply not to be used, how do we tease out the effectiveness of the teacher? Somewhere above, there was mention of the effect of Summer School and the range of foci; if we have interactions, it is surely going to get messy. So ditch the lot? Hattie (and Horatio) make(s) many statements, particularly in “Visible Learning for Teachers”, that the data is problematic, and the results/conclusions non-prescriptive. Hattie also points to the research *design* requirements, yet to be fulfilled, and, as I have read it, and having spoken with the man about it, we are therefore left with meta-analyses. So is this really so, so bad that we must pay no heed or not spend any time performing such calculations, etc? I have seen first hand the result of little or no attempt to evaluate teacher/programme effectiveness at the school level, and the use of resources, human and financial, is quite anxiety provoking, if I may put it mildly.

And one thing about Educational Research – how do the researchers go about making it readily available for Teachers? You, EC and the guy from Norway would help us all greatly if you published your research in a manner that nailed down how we conclusively get past the simple pre/post scores method.

Thanks,

B-)

• On the general matter of effect size: I think I agree with Coe’s description that it is a slightly crude, but often necessary, way of collecting different outcomes. Let’s say I have studies from 20 states in the US, each using their own state-specific test of maths ability, but using the same intervention (let’s say, banning vs using calculators). I can’t compare the difference in test scores directly, but if I believe that the tests are measuring roughly the same things, I can use a transformation to effect size to compare the difference in test scores. There are potential issues there: if the tests are too dissimilar, we’ve lost vital information, and if the standard deviations differ for other reasons (e.g. some states allow schools to only enter their best pupils for the test) then this can upset the whole thing.

In a way, the issue is unrelated to the effect size measure itself. If I have one measure that says a film critic gave a film 4 stars (of 5), and another that says 14 of 21 people surveyed would recommend the film to a friend, I can’t directly combine those measures. I can convert them both to a percentage (80% for film critic, 66% for public) but that doesn’t mean I can sensibly compare the two. The problem is not with percentages, the problem is in comparing two things that are not sufficiently similar. Meta analyses can be a sensible tool, but there is still a matter of correct judgement as to what studies can and cannot be combined.

If we want to measure teacher effectiveness for specific teachers, the best option (but still not perfect) would be an attempt to hold all other variables constant (e.g. school, curriculum, method of assessment, incoming pupil ability) and test the learning gains. But there are some variables that are near impossible to constrain. Imagine two teachers, Hardknuckle and Softly. Hardknuckle is effective at disciplining his classes, but slightly weaker at imparting knowledge, whereas Softly is weak on discipline but better on knowledge. Obviously, if you give Softly a class of students that is inclined to be well-behaved, he will look like the better teacher, but with a tougher class Hardknuckle looks better. It’s hard to get enough data for specific teachers to rise above all the other factors. And this problem still strikes looking at a higher-level: if the within-teacher variance (the difference in the same teacher’s performance with different classes) is very high, it is hard to examine the between-teacher variance (the difference between different teachers). Naively, the answer is: more data. But I think you need to assess this sort of thing in a single large study, not by Hattie’s indirect method of comparing between-teacher effect sizes to other effect sizes when that wasn’t what the studies set out to look at.

As for researcher dissemination: academics are generally bad at disseminating their research to the public. But, as I’ve written here: http://academiccomputing.wordpress.com/2013/05/02/accessing-computing-education-research/, I think disseminating individual research studies too strongly can be harmful as a whole, as they each seem to conflict with each other, and the public get very turned off (“Researchers seem to change their minds every five minutes!”). I think where researchers really should push out their results is when they perform meta-analyses and literature surveys, which are more neatly packaged, wider summaries of a research area. These summary studies are probably the best way for teachers to consume education research, especially if they only have time to read part of research.

That’s quite a long comment, I hope it’s gone some way to address your questions.

4. I am very, very happy to finally have some more company in showing that Visible Learning is deeply flawed. It is amazing that more people haven’t written about these issues in the 4 years since the book was published (if I recall correctly, it’s you, me, and a guy from Norway), and that of the three people who have written about them, none of us is, as far as I can tell, working directly in the field of educational research. What does this say about the state of that field?!

As for Horatio’s comment, I would disagree or quibble with nearly every sentence, but I am especially amazed at his assertion that Hattie’s work will be “closely scrutinized by academics around the world.” It has been four years! If this is, as Horatio suggests, the best that Ed research has to offer, then the field is starting at less than zero.

Anyway, thanks for the thoughtful post.

• Hi EC,

I was reading the book, got to page 9 and thought “Hang on, that common language effect size doesn’t make sense. What am I missing?” I googled and found your blog post about it. “I’m not missing anything!” So I was pleased for your original blog post confirming my thoughts.

There’s no guarantee that anyone in education research has read the book. As you say, I’m not in straight education research (I’m heavy on the computing in computing education); I’d never heard of Hattie or the book until I saw it mentioned by some teachers, which prompted me to read it. To pick on one particular issue: by my understanding you cannot sensibly average effect sizes across different outcome measures (e.g. achievement vs self-esteem). But it’s not just Hattie that does it; several of the meta-analyses I saw also seemed to make this mistake.

That does leave one skeptical about education research, but there will of course be good papers and bad in any field. The difficulty is equipping the non-researcher (and, to be honest, a lot of the researchers) to tell the difference between good and bad! Tom Bennett’s “Teacher Proof” book showed that teachers get ludicrously bad research foisted on them, but there may come an unfortunate point where teachers instead get subtly bad research pushed to them, and they do not have the knowledge to understand its limitations. Researchers should certainly be very careful to criticise teachers en masse, when researchers en masse are not necessarily any better.

• You say “there’s no guarantee that anyone in ed research has read the book”, but that’s easily checked, and in fact lots of people in ed research have read the book. There been many reviews of the book and it is the most prominent meta-meta-analysis of education research (a guy named Marzano has also done work in this area, but his work seems to be slightly less famous). And the fact that Hattie seemed to be surprised by the negative probability issue when a norwegian mathemetician wrote about it two years after hattie’s book had come out suggests that ed research is not a field that is able to do its own quality control. Yes, it is great if teachers can learn to distinguish between good and bad research, but that is really not our job, and most of us simply don’t have the time or expertise. A respectable field (like physics or car repair) will be self-policing. Ed research is clearly not doing that, so I feel pretty comfortable criticizing them en masse. It’s not that any given ed researcher is incompetent (I know some lovely people who work in the field), it’s that the field is just not working well enough that we can take even the most apparently authoritative people (John Hattie, Tim Shanahan, etc.) seriously without looking into it ourselves.

In fact, I would say we are already at the “unfortunate point” where we are having bad research pushed on us. It happened on a massive scale in the US with the 2000 National REading Panel report and the associated multi-billion dollar effort (“Reading First”) to put its findings into practice. Eight years later, it was determined that the effort has been a total failure. We continue to get one cockeyed idea after another (students must tread 70% non-fiction, etc.) pushed at us under the cover of this or that totally unconvincing “research.” It is very important that teachers realize that MOST ed. research is very far from definitive, and that in most cases we would be best off ignoring it. I write that with sadness, as I am someone who really thinks science can be tremendously useful, but as I’ve looked into the research over the past couple of years I have been horrified at what I’ve seen–not so much that the research itself has been terrible, but that the researchers have not hesitated to make the wildest logical leaps in their interpretation of what the results mean for classroom practice. Anyway, thanks again for your post, and I hope you write more about this kind of thing.

5. Jen

Came across this via twitter, wanted to comment on the issue some other comments have brought up about ‘education research foistering dodgy approaches on, and harshly criticising, teachers’ – in my experience (a education PhD student who has worked extensively in schools) it is not the researchers THEMSELVES who are doing this, often as this post demonstrates, they are dealing with vastly complicated studies where the knowledge and ‘truth’ to be gained is far from simple or clear – the best of them are therefore rightly tentative in their claims. However, the mass media and policy makers are very often, not so. Particularly in the UK, policy-making bodies will trawl research for support for whatever hair-brained idea the latest minister has dreamed up, snatch that particular piece of research whilst often showing staggering ignorance of its context, the methodological issues, or trends in the field and use it as a stick to beat the teaching professon with.
And so we researchers sit in offices and weep.
This is a much larger issue, in my humble opinion, of the relationship between research, the media and policy-making. There have been various drives in various fields (particularly education) to ‘evidence-based-policy’, yet nearly all the research (ha!) in this area shows we are a long way from this ideal…

I agree wholeheartedly with Tom Bennett’s overall assertion that teachers should be active critics, as well as producers, of education research. There’s a common rhetoric in education research that we should aim for the ideal of medical research in terms of validity and professional autonomy. I’ve got all kinds of epistemological issues on why we can’t treat education interventions the same way we treat the injection of a drug, but I think if researchers and the teaching profession were able to forge a strong, mutually respectful and collaborative relationship, we could command the same (relative) repect and freedom from constant policy tweaking that medical care enjoys.

Hope to see some of you at ResearchEd13 to start making this dream a reality!

• I’d like to think it’s not the researchers themselves who are making dodgy claims, and I do think that journal articles tend to be much more circumspect in their claims, but all too often it IS the researchers themselves, often in articles for the popular press, who make the overstated claims. In the case of John Hattie’s Visible Learning, the errors are not just overstating and making logical leaps, but calculation and comparison errors that make me suspicious of Hattie’s mathematical competence and of his entire project.