Class Size, Test Scores and Earnings: Are Smaller Class Sizes Good Public Policy?

This debate over class size raises some important questions. For instance what is the impact of class size on the educational outcomes of Australian children? How do class size reductions compare with other educational reforms in terms of their cost effectiveness? It turns out we know very little about how class size influences student learning in Australia so in this post I’ll take a look at what some of the better research from the US has to say about the effectiveness of reductions in class size.

The main challenge faced by researchers in trying to figure out whether small class sizes have a positive impact on educational outcomes is that the children we observe in small classes may not be representative of all children. Recent work by  Louise Watson and Chris Ryan suggest that the student-teacher ratio (which as it turns out is a different thing to class size) in Independent schools have fallen relative to Government schools over the last 20 years in Australia. They also show how the socio-economic status of Government school students has declined over this period relative to children in Independent schools. If it is the case that those children enrolled at well funded Independent schools have parents with greater means and/or a greater willingness to invest in the education of their children inside the home, we may find these children to have better educational outcomes for reasons that have nothing to do with class size. The same might be true if the higher socio-economic status of a child’s classmates conferred upon them some positive peer-effect in their learning or if Independent schools had other characteristics (higher quality teachers, better facilities) that improved child outcomes independent of class size.

This isn’t necessarily a problem if educational researchers had data on all of these influences and were able to use statistical methods to control for them. The trouble is this usually isn’t the case. The only way we can untangle the causal effect of class size is to rely on policy experiments such as those conducted in the US. To date no such experimental evidence exists for Australia.

How can policy makers learn about the impact of class size on student learning?

One of the most cited studies on the effectiveness off class size is Krueger (1999). Krueger used data from a policy experiment that was branded Project STAR (the Student/Teacher Achievement Ratio experiment) was conducted between 1985 and 1989 in the US state of Tennessee. Project STAR involved the random assignment of 11,600 students in 80 public schools to three different class types.

Small classes 13-17 students per teacher
Regular classes 22-25 students per teacher
Regular aid classes 22-25 students per teacher and a full-time teacher’s aid

The experiment did not cover all public schools. A school had to be large enough to accommodate one of each of these class types but eligible schools were spread across urban and regional areas. Children were randomly assigned in Kindergarten (the year before year one in Australia) with the experiment continuing until third grade. Most of the students entered the study in 1985 when they started Kindergarten. However, an additional 2,200 students joined in the 1st grade, since Kindergarten was not mandatory in Tennessee. Students who repeated a grade left the study. Teachers in participating schools were randomly assigned a class type.

STAR wasn’t a perfect experiment. Obviously running an experiment in the real world is going to be a touch more difficult than running one in a lab. Parents who complained could get their children re-randomised between the two types of regular classes (p. 520) and some children with behavioural problems were re-assigned to small classes. There is also the possibility that students who left the study were not representative of the student population. We might also be concerned that some parents moved their children to private schools upon learning of their child’s initial assignment to a regular class.

That being said the numbers of students who transition between class types is not overly large (p. 507) and it doesn’t appear that many parents moved their children out of the experiment in response to their child’s initial class assignment. Krueger was able to obtain enrollment forms for 18 of the 80 schools from which we learn that 10.4% of children assigned to small classes left the experiment compared to 14.3% percent of those assigned to regular classes and 12.2% of those assigned to regular classes with an aid (p. 516). The percentage of children in small classes who were held back at some point between Kindergarten and 3rd Grade was somewhat lower at 19.8% compared to 27.4% of children in regular classes (p. 505). Overall different types of students were evenly spread across the three class sizes (p. 504).

Student learning was measured using the Stanford Achievement Test (SAT). SAT is a test of reading, word recognition and math. In order to aid comparisons between children in small and regular classes children in regular classes were assigned their SAT score percentile while children in small classes were assigned the regular class percentile that matched their raw SAT score. A summary measure was constructed by taking the average of the reading, word recognition and math percentiles.

Is there a causal effect of class size (in Tennessee)?

Focusing on Krueger’s results that control for student, teacher and school characteristics the average impact of small class size, and regular class with an aide on average SAT percentile scores relative to regular non-aid classes in each grade is…

K 1 2 3
Small 5.37 7.4 5.79 5
Regular with Aid 0.31 1.78 1.58 -0.75

Source: Krueger 1999 – Column 5 of Table V, pp. 512-13.

The above results use observed class type rather than assigned class type but this appears to make little difference to the results (compare columns 1-4 in with columns 5-8). As we would expect if class type were approximately random then including student, teacher and class characteristics in the modeling should make little difference to the estimate of class size (compare columns 2 and 4, 5 and 8) and this is what Krueger finds.

As Krueger shows in Table III there was a little bit of overlap in actual class size among the three class types. The average small class size was 15.7 students compared to 22.7 for the regular non-aide classes and 23.4 for the aide classes. For this reason he presents results that model the exact class size that each students found themselves in. After adjusting appropriately the results are very similar to the above (Table VIII, p. 518).

Krueger also examines whether the benefits that accrue to smaller class size accumulate over the course of the experiment or are confined to the initial impact of finding oneself in a small class. This is interesting as many students who were allocated to small classes entered school in 1st grade rather than Kindergarten so not all students experienced the same number of years in the class type that they were assigned.

His preferred estimates (Table IX, column 3), put the initial impact of small class assignment at just under a 3 percentile increase compared to a regular class with an additional 0.65 percentile increase for each year in a small class. These results are not however statistically significance. For non-stats-geeks

“When the same students are tracked over time…students in small classes [gain] about one percentile rank per year relative to students in regular classes…students appear to benefit particularly from attending a small class the first year they attend one, whether that is Kindergarten, first, second, or third grade…” pp. 523-4

Are some students impacted differently?

The numbers above are Krueger’s best attempts to estimate the impact of small class size on the average student. Obviously not all students are average and it may be that the consequences of larger class sizes are more adverse for certain groups of students. Krueger finds boys, students from low-income families (as measured by eligibility for free school lunches), students of African American background and students from the Inner-City all receive greater test score increases. There is also some evidence that the small class sizes are more important for reading than for math (Table XI, p. 528).

Boys Girls
4.18 1.28
Free lunch No free lunch
3.14 2.85
Black White
3.84 2.58
Inner-City Metropolitan
3.74 3.09

Source: Krueger 1999 – Table X, p. 525.

Is the effect of class size large or small?

This is a difficult question to answer. While there is a vast literature that suggests that performance on cognitive tests are able to predict later life outcomes, it is difficult to know how a 3-6 percentile increase in SAT percentiles, on average, translates into outcomes that matter such as school completion, university participation and labour productivity. Cognitive tests are not an end in themselves; they are merely one indicator of the private and social returns of investments in education. In Krueger’s words…

“Is the impact of attending a small class big or small? Unfortunately, it is unclear how percentile scores on these tests map into tangible outcomes. Nevertheless, a couple of comparisons are informative…one could compare the estimated class-size effects with the effects of other student characteristics. For example, in kindergarten the impact of being assigned to a small class is about 64 percent as large as the white-black test score gap, and in third grade it is 82 percent as large. By both metrics, the magnitudes are sizable.” p. 514

Recent work by Chetty and a cast of co-authors has sought to shed some light on the private returns to small class size.  Chetty et. al. (2010) (NBER working paper version) match STAR students to their tax records between 1999 and 2007 when they were aged between 19 and 27. From these tax records they calculate average earnings between 2005 and 2007 for STAR students for whom they are able to match in 95% of cases. Where parents listed STAR children as dependents in their tax returns they are able to obtain average adjusted gross household income between 1996-98. This provides a measure of the financial resources of the household in which the child lived when they were aged 16-18, an important variable missing from the original STAR data (this has to be better than free-lunch eligibility). The authors are able to match 86% of children to their parents. Somewhat surprisingly they can also infer College attendance from the tax data. In order to receive tax credits for fee relief “Title IV” tertiary education institutions in the US have to report on tuition payments and scholarships received by all students.

As was done in the Krueger paper, the authors are careful to make sure that the pre-labour market student characteristics and those of their parents are not associated with assignment to a small class (p. 11, see also column 2 of Table II). This is important because the original STAR data didn’t contain any information on parental characteristics thereby adding to the credibility of STAR as a pretty good approximation to true experiment.

They also make sure that students who were matched are no different from those who were not, at least based on what we observe of the students. After controlling for which school students went to, it appears there is very little difference in the match rates between small and large classes. While it might be that schools in disadvantaged areas might produce children who are less likely to work and/or file tax returns it doesn’t appear that these children were more less likely to be allocated to a small class within schools (p. 12, columns 1 and 2 of Table III).

What were the “real-world” outcomes of small class size for students in the STAR project?

Chetty et. al. use the same statistical method that Krueger used to form the estimates in my previous table. They find a similar impact of small class size on test scores to that found by Krueger (for the initial year of assignment to a small class) even after controlling for parental characteristics that Krueger could not. While Chetty et. al.’s results suggest that students assigned to small classes have a 2% greater probability of entering university by age 20, this result is not statistically significant at a 5% level of significance. This is stats-geek talk for small class sizes being unlikely to have any impact on university participation. The results for wages are even less certain. In the authors’ own words

“the average student assigned to a small class spent 2.27 years in a small class, while those assigned to a large class spent 0.13 years in a small class. On average, large classes had 22.6 students while small classes had 15.1 students. Hence, the impacts on adult outcomes below should be interpreted as effects of attending a class that is 33% smaller for 2.14 years [ p. 16 of Working Paper version]…With controls for demographic characteristics, the point estimate of the earnings impact becomes -$124 [per year] (with a standard error of $336). Though the point estimate is negative, the upper bound of the 95% confidence interval is an earnings gain of $535 (3.4%) gain per year.” p. 18 of Working Paper version.

Put simply they are unable to find any convincing evidence for an impact of small class sizes on wage income later in life.

What are the policy implications of the findings from the STAR project for Australia?

As I suggested earlier we don’t know for certain that these findings would be replicated if we had run a STAR-like randomised policy trial in Australia.

It might be that class-size has an impact upon outcomes unrelated to cognitive test scores, such as self-confidence, inter-personal skills and other personality traits (non-cognitive skills) that the Nobel Laureate  James Heckman and his co-authors find are important. While these may have their own benefits it doesn’t appear that this is reflected in early labour market earnings. This is not to say there are no social returns to small class sizes but it does cast some doubt on whether smaller class sizes would be able to pay for themselves in taxes collected later in life. The case for smaller class sizes would be strengthened if research could show that there was a causal link with adverse student outcomes that are quite costly such as crime or risky behaviours such as drug use.

The main implication of the US research for Australia is that there is a need for a randomized policy experiment similar to STAR. In addition to randomly assigning children to different class sizes this trial could also randomly assign students to teachers shown to have different abilities to increase NAPLAN scores, one potential measure of teacher quality. Such a policy trial would provide State and Commonwealth Governments with a better idea of the extent to which of class size and teacher quality (albeit narrowly defined) yield greater returns for a given level of the other. This is the sort information required to ensure that the education dollar is spent in the most efficient way for the greatest effect. It would also provide an evidence-base that would strengthen the political position of a Government attempting to implement such reforms in the future.

This policy trial should be configured in such a way as to provide precise estimates of the impact on children from disadvantaged backgrounds (Indigenous students, students with disabilities and students from low socio-economic backgrounds). As Krueger’s results indicate it may be that children who enter school with lower levels of parental investment will require a different combination of teacher quality and class size to students from more advantaged backgrounds.

Given the long lead time between implementing the trial and being able to assess later life outcomes such as university participation and earnings some may find my view that Governments should fund such policy experiments to be politically naive. I would argue that this highly politicised debate has gone on in this country for decades and has occurred in an evidence vacuum. Most State and Federal Governments get at least 2 terms in office. Six years is the time it takes for a child to go from Preparatory/Kindergarten/Reception/Pre-primary/Transition to Year 5 or from Year 6 to school completion. Once the data is in the public domain it’s there forever. Were an incoming government to terminate the experiment there would still be the potential to link the data to tax records and learn about the long-term outcomes of the programme in the same what that Chetty et. al. did with STAR. Building consensus for controversial reforms might take more time than most Governments have but the benefits of good policy once implemented are reaped in perpetuity.

To conclude, it is widely accepted that class size is not the be all and end all of increasing student performance. The political issue is that class size is a quantifiable and tangible policy lever that avoids the complex questions of the sort a policy maker confronts in defining and measuring something like teacher quality. In my opinion Australian students and the Australian economy would be best served by teachers, parents, politicians and, dare I say, students, coming to a consensus about what teacher quality means and how best to measure it. Such a consensus combined with a randomised policy trial that sheds some light on the optimal combination of resources for different schools could unlock the private and social benefits of expenditure on education in Australia.


Raj Chetty, Friedman, J. N., Hilger, N., Saez, E., Schanzenbach, D. W. and Yagan, D. (2011). How Does Your Kindergarten Classroom Affect Your Earnings? Evidence from Project Star. The Quarterly Journal of Economics. Vol. 126, No. 4, pp. 1593-1660.

Alan B. Krueger  (1999) Experimental estimates of education production function. Quarterly Journal of Economics. Vol. 144, No. 2, pp. 497-532.

Louise Watson and C. Ryan (2010) Choosers and losers: The Impact of government subsidies on Australia secondary schools, Australian Journal of Education, Vol. 54, No. 1, pp. 86-107.

Further reading

Joshua D. Angrist and V. Lavy (1999) Using Maimondides’ rule to estimate the effect of class size on scholastic achievement. Quarterly Journal of Economics. Vol. 144, No. 2, pp. 533-575.

6 thoughts on “Class Size, Test Scores and Earnings: Are Smaller Class Sizes Good Public Policy?

    • I think Judith makes a good point. It’s not as if the increases in education funding in recent times have brought with it better student outcomes so it’s probably time we looked closely at the productivity of the sector. My view of the research is that there is an impact of class size but the impact is fairly modest once you get below the mid 20s. At this point you probably need to take a look at teacher quality and principal autonomy if you want to improve outcomes. Perhaps there is an argument for increased funding but we need to know that any increase will be placed in policies that are proven to work. Unfortunately we don’t have what I would call a comprehensive evidence base for policy formulation in Australia.

  1. Few questions:
    How are these percentile scores assigned re SAT?
    Do you think the discussion should focus perhaps on raising ATAR scores to get into Teaching degrees at uni? They currently accept students with some of the lowest ATARs (Deakin’s acceptance is at 55%).
    Under your paragraph “Are some students impacted differently?” you talk about how the effects might be different for different types of students and single out African-American backgrounds and low-income status. In creating the estimates of impact on low SES, was ethnic background controlled for, and vice versa?
    How and at what points does re-randomisation take place?
    Don’t you think your recommendation re basing teacher’s performance on a test, say NAPLAN, would create perverse incentives?
    And, this is trivial, but WHY SO LONG?

    • Regarding percentile scores my understanding is that Krueger took the the sub-sample of normal-class-size students and calculated percentiles from the raw scores. I have no idea what the units of the SAT test were but say a raw score of say 45 placed a child in the 85th percentile and a score of 50 placed a child in the 90th percentile. He takes this mapping of raw scores to percentiles for the normal-class kids and assigns to each small-class kid the percentile that corresponds to that child’s raw score.
      The reason he does this is because the purpose of the score is to compare normal-class kids and small-class kids so there’s two points to note. First of all test scores (e.g. a score of 45 or 50) don’t have a natural interpretation, converting them to percentiles allows us to make relative (but not absolute) comparisons between individual students or groups of students. If the impact of small-class size was a five point increase we wouldn’t really know what that means. Knowing that this 5 point increase moves a child from the 85th to the 90th percentile has a clearer interpretation in that it moves you 5 percentiles up the SAT score distribution. That being said it’s still not clear that a 5 percentile increase is “large” or “small”.
      Secondly, the reason that the small-class kids are assigned the percentile associaited with the raw score of a normal-class kid is that these are the groups we want to compare. If Krueger included all of the children in calculating this mapping the fact that the small-class kids got better results would wash out the very impact that he’s attempting to estimate. The estimates tell us how far up the normal-class test score distribution a small-class child is moved by virtue of being in a small class, on average all else equal. This mapping would not be necessary if Krueger were only interested in raw scores but like I said the impact on the raw score is difficult to interpret.
      With regard to ATARs I think this debate kinda misses the point. ATARs are determined by the supply places in a course and students’ demand for courses. The reason ATARs are low must be the result of either an oversupply of places or a lack of demand for teaching degrees. Artificially increasing the ATAR, which I can only assume means restricting places, will do little to increase the average quality of applicants for teaching degrees.
      Regarding the heterogeniety of the small-class impact these would have been what are called “interaction terms”. These variables estimate the impact of being, for example, black and in a small class controlling for the independent effect of being black and the independent effect of being in a small class. If you don’t control for the independent, or main, effects you can’t really interpret these interaction terms.
      I can only assume that re-randomisation of students whose parent’s complained occurred at the point where the parents complained. I don’t know for certain but I assume most parents would complain when their child first entered the experiment which would have been kindergarten or first grade.
      I’ll leave my views on NAPLAN and perverse incentives for a later post on teacher quality.

  2. I’m a pre-school teacher and I agree with Gab. i don’t think NAPLAN accurately measure what my kids can do. They learn through play based learning. But don’t test well on NAPLAN because of the narrow scope. Different types of pedagogy aren’t usually seen on NAPLAN results. And whilst we may not be directly funded my NAPLAN (yet) we are indirectly affected by the choices parents make as a result of the MySchool website.
    Principal autonomy is definitely not a good idea as it is putting a lot of faith into the integrity of principals. Most of them are much older and are quite averse to the new ideas myself and my colleagues bring to the school. I face this at my school.
    I have a few problem children in my class who could do with extra hours of instruction but because of the relatively large class size, i cannot give them the attention they sorely deserve. Small class sizes especially in the formative years are of extreme importance. Whilst the data you cited for a country that is quite different to ours, where the educational system isn’t the same either, may not show this, I can tell you from my own experience, that if I had a smaller class to teach, i would be better able to teach them what they need to know.

    • Thank you for your comments. I don’t think there’s been enough participation in this debate from teachers. Most of the opining has been from politicians, wonks and the AEU. You raise a good point about play-based learning. When you’re talking about children below Year 3, which is when NAPLAN starts, I would imagine it’s difficult to construct psychometric instruments to measure learning but then I can’t claim to be an authority on cognitive testing. This presents a problem in using test scores to assess the performance of teachers who have chosen to specialise in the early years of school. I’ll return to these issues in a later post on teacher quality.
      The counterfactual for principlal autonomy is centralised bureacracy. I would have thought principals are in a better position to make an assessment about which teachers are a good fit for their school than a public servant in a state education authority. That being said, your comment has made me ponder why it is that the debate has focused on principal autonomy and teacher quality without any focus on principal quality. If they’re going to be given greater autonomy it would be good if there was an evidence base to suggest that this improved student outcomes.
      Your entirely legitimate concerns regarding your students who could use more one-on-one time sounds more like an argument for funding for qualified teaching aides. This is an important aspect of the debate that has been largely neglected. The class size debate is really a student-teacher ratio debate so you’d think that teachers’ aides would feature more prominantly. It makes me think that teacher’s aides should be included in my likely-to-never-be-realised randomised policy trial.
      I’ve been pretty up front about stating that Krueger’s results, which do support a class size effect, are not necessarily generalisable to Australia. This is why I think we need our own Project STAR. I’m inclined to think that the results may not be markedly different but I also think it’s difficult to gain political support for reforms based on data from a policy experiment conducted in one state in the US a quarter of a century ago. When I listen to the commentary coming out of the AEU and the Grattan Institute I begin to feel like I’m in the minority in that respect.

Comments are closed.