Exam grade process 2020: eight areas for enquiry

Blog / August 13, 2020

The fiasco of 2020 exam results has only just begun and as soon as Scotland’s results came out it was clear a review is needed into UK exam processes.
Ofqual’s guide to the 2020 AS and A level results published today in England, includes a technical report with further details of the standardisation model.
GCSE Results will not be published until August 20.

We agree with the Royal Statistical Society call for investigation and that the question must be addressed: “If the underlying process is flawed then possibly the whole thing is flawed.”

Ofqual speaking on 2020 results has said it will not accept appeals where a school disagrees with the standardisation model. Nobody expects this Ofqual imposition. It must allow independent scrutiny not only of this year’s process, but of all routine modelling.

hilst comparable outcomes may matter to support the perceived integrity of the qualifications system, it is vital that individual students lives and future prospects are not unfairly harmed or limited in any way due to the exceptional circumstances this year or any part of the system design at any time.
In summary: These outcomes were expected. A key question that must be asked is who has responsibility for the decision that the known outcomes, after the standardisation and knowing its effects that would be unfair, were acceptable to be unfair in this way, and not in another way for other students? We want to know if the explanation was given to Ministers, who again was told that this was the expected ‘best model’ and its outcomes, and made the decision that this would be the chosen model.

It is vital that this process is understood as a human one, accountability does not sit within an algorithm but people in paid positions with responsibility to serve the public.

Only after that, should the process itself be taken apart and examined for its construction, and built-in bad data, bad design, and the decisions that went into it, that delivered bad outcomes.

This needs addressed as soon as possible to bring clarity on results, appeals and ‘resits’ for candidates this year, for planning on exams 2021, and to assess the shared questions over what to do about the accountability and exams systems where cohort data for 2020 do not exist for future data modelling. As we suggested in 2019 proposals [download .pdf 2.2MB] such an audit of algorithmic accountability and its effects in practice, is needed across the public sector particularly in education, where outcomes may affect children for the rest of their lives.

At that point, what kinds of questions should any inquiry on the standardisation system make? We suggest eight areas to start with.

Data model inputs: Are the repurposed data feeds used in the standardisation process, suitable as input data at all?
Discrimination in the input data: Assess historic cohort data for accuracy, fairness and discrimination.
Aims: Are the purposes of the standardisation models with a focus on system integrity in conflict with individual candidates’ best interests?
Outcomes: Does the system that standardisation is designed to protect, work for the people it is designed to serve especially where 1/3 will fail by design?
Automation bias: How does any automated part of the system account for reasonable human changes, that it views as an unacceptable anomaly?
Discrimination in the output data: Assess the output of the standardisation for accuracy, fairness and discrimination.
Fairness by design: Who benefits from the system and who loses out? Are child rights, equality and ethics assessment working? Those who lose out need systemic change.
Explainability: What tools are available to explain exam results to students? Where results are not simply based on a pre-determined X marks equals Y grade, but subject to standardisation, then a standard printable report should be offered.

Historic cohort data

The process this year looks different without exam marks, but the correlation, explained variability and strength of the predictive historical profiles are always factors used to set grade boundaries and range, every year in England. This year the cohort data from 2018 and 2019 and relationship of value added progress measures, carries weight that should be examined not as a stand alone model in 2020, but what happens every year.
The assumption is that pupil performance across a cohort is similar to the year before.

The marrying up between the group level 2020 cohort prior attainment (the mean GCSE score for A-Level, and SATs for GCSE) and at Centre level (exam Centre = school or similar org that submitted the candidates) what the expected distribution might be, that has been overlaid with the rank order across all candidates appears to have carried undue balance in the weighting, with lower weighting given to the centre assessment grades (CAGs) (these CAGs are the teacher’s gradings of what mark they expected each child should get). In very small groups, the CAGs were used rather than historic data. Normally, somewhat simplified but in effect, the exam marks provide the rank order, but standardisation of grade boundaries and a range will happen anyway.

For GCSEs this standardisation usually includes use of cohort data from children’s Standard Assessment Tests (SATs) taken at age 10. This plays an opaque role in GCSE standardisation and one must ask whether it is suitable to repurpose these tests at all for this purpose? SATs scores use for supporting system integrity at GCSE, is inextricably linked to the role of SATs in the accountability system all through secondary school. Yet families are told not to prep children for SATS. They are told SATs are not a measure of attainment and do not matter. The Schools Minister even as recently as May 27th, told the Education Select Committee that SATs are an accountability measure for the school system. And denied that they have any significant effect on young people for the rest of their lives. But they do. SATs scores are used to profile state educated children with progress measures, predetermine a child’s educational path, and predict GCSE grades, as well as being used in the results standardisation. Here’s an Ofqual 2017 video of how it is done. Some families buy pre-SATs tutoring. If some children score better, it may be affluence not ability that weights the prediction of their GCSE results. Privately educated children do not sit them at all so what is the effect in results weighting, of missing data?

Most importantly, even if Ofqual was adequately conscious of bias and discrimination in their own modelling this year and made adjustments, they appear to never take any account of the bias and error in the historical data that feed their models. Garbage in, garbage out. Key Stage 2 SATs scores are embedded garbage. As an input it is not fit for this re-purpose.

Purposes of the data standardisation

Secondly, an enquiry must look at whether the purposes of the standardisation models are in conflict. Ofqual’s key objective is to prevent grade inflation, in other words to protect the system. The National Reference Test is also used among other data for this. But what if the system you’re trying to protect, is not working for the people it is designed to serve? Do they deliver the best outcomes for children?

The exam system in England fails 1 in 3 young people *by design* at GCSE—children cannot get good pass marks assessed only on ability but on their position in the peer group range, and compared to others’ historic data.

Ofqual makes assumptions that shifts in cohort data are rare, and this year have chosen to not accept what they see as unreliable human data inputs (the teacher CAGs) over and above what they may fail to see, as equally unreliable historic data inputs. The question here is to ask how the whole system of standardisation including the model’s design, its technical choices, cause issues with particular groups and to what extent does it detrimentally affect individual pupils in particular those where results wildly overrode teacher assessment (CAGs).

Algorithmic accountability

Thirdly, the weighting of human inputs need assessed in the model; in normal years this includes your exam marks, and this year the CAGs and rankings, compared with what teachers were told to expect. School staff are rightly livid in A-level results, that their suggested ‘Centre Assessment Grade’ (CAG) for each candidate have been largely ignored, with the Head of ASCL, Geoff Barton saying “they might as well not have bothered.” Separately and in addition, staff were asked to provide the rank order of candidates. At GCSE there’s lots more nuance, like which tier of paper you were entered for [Foundation or Higher] too, but this is the key gist we think matters most.

The student rank order, made by school staff, has been fundamental in A-level grades, but secondary to a school’s historic distribution of grades. It is this ranking, and the predicted range of the Centre based on past data, rather than merit of the individual or CAGs, that plays a significant role in ruling out outliers — and flattening better grades achieved than in the past, into the usual grade distribution range.

Ofqual in its FAQs for all exams, exposes the problem that may result from basing expectations on a previous cohort’s attainment. Where a school genuinely may have a significant change and improvement across the board that could result in wrong data being used, in the example they give, an all boys school becomes co-ed. they can appeal, but not if they simply disagree with the model.

This example however is also true of normal years, and let’s say where a new permanent specialist teacher joins a school in a post taught previously by stand-in supply teachers, the whole class might be expected to do better than previous years. Yet the results may not be allowed to show it, because of the system constraints.

This absurdity is exposed if one takes the Toby Young thinking on improving intelligence through a smart pill, or selection at birth. It doesn’t matter if every child in your whole year group was cleverer than the cohort of the year before. Your group in Year C will be slotted into the expected spread of grades from Year A and B and only some of you can get a top grade, and a third will fail. It doesn’t matter if every single child in the bottom third in year C were all cleverer than the top third of year A and B before.

Your result is not a measure of your ability. It is simply a reflection of your attainment on a given day, and in ranked relation to others.
Pupils cannot do much better, unless an equal and opposite number of pupils do less well. The system appears to fail to account for human changes, that it views as an unacceptable anomaly. Social mobility in exams is a myth.

Discrimination

Some teachers this year are concerned with a particular aspect of discrimination, that pupils in deprived areas, have been disproportionately moved downwards as happened in Scotland. It was entirely predictable and a puzzle that the standardisation process did not seek to sort it out before now, because small cohorts for whom the teacher CAGs have been accepted are more common in fee-paying schools. The same problems of bias exist in the system in England every year, and in the accountability system.

Performance rankings of schools are based on results at primary level, on how the children do in a week of tests after seven years at school; and in secondary schools on how much progress the children make between those tests and GCSEs compared with the average amount of progress made by all children. As Geoff Barton of ASCL explains:

“Schools with lots of disadvantaged pupils often do less well in these tables because – not surprisingly – disadvantaged children have challenges that affluent children don’t have. So really, such measures end up being a reflection of the demographic make-up of the area that the school is in.”
Both the exams system and accountability system are unethical and flawed by design.

Action

Regulators must accept every opportunity this year to expose systemic flaws and discrimination embedded in the exam and school accountability systems’ design. Long-established data-driven models in England’s education system must be opened up to solve their problems and demonstrate algorithmic fairness.

Otherwise the public and professional perception will remain, that the system has become more important to protect, than the young people it is supposed to serve.

The failures this year do not need tweaking for acceptability in 2020, for those that take exams this autumn whether for the first time or as ‘resits’, or even to understand the implications for 2021. The system needs taken apart and redesigned.

Who benefits from an unethical system broken by design? Those who lose out need this systemic change.

And is it made understandable to students? It should be. And not in theoretical diagrams or flyers aimed at one-size-fits-all.

In June we proposed to Ofqual that they construct a printable tool that explains how your exam result was reached. We think this would be useful every year, not only this year without exam marks. An A4 flyer report template for candidates, that schools would be able to download and feed in their pupil level data at each stage ‘here’s your exam marks, here’s the mean GCSE, the SATs scores and here’s the decision making points and the data flows that moderated your grade’. It seems not only a practical help for schools trying to explain results to children, but a necessity in a time of growing awareness of the roles of bias including ableism, racism and deprivation, in algorithmic discrimination and a need for restoration of data justice and public trust.

Of course, Ofqual should have seen all this coming, even between a rock and a hard place this year. After all, as their Chair himself co-authored in 2016,

“For citizens concerned that they are being treated unfairly by a regulated organisation and who call for greater transparency, the publication of information will often add to their woes, because it will fail to address the issue that they regard as being important. All it does is change the focus of their demands from greater transparency about regulated organisations to greater transparency about the way the regulator operates.”

“The potential risk is exacerbated by conflicts of interest that can incentivise regulators to misrepresent the public interest in order to make the regulatory task more achievable — to define regulatory success in a way that is easier to deliver, even if it less accurately reflects success in terms of public benefit.”

References

Scotland (SQA) methodology, impact assessments and data. [link]
Ofqual’s guide to the 2020 AS and A level results in England. [archived link]
Ofqual Consultation outcome: Proposed changes to the assessment of GCSEs, AS and A levels in 2021 [link]
HEPI blog [link] August 10, 2020

Last updated: August 25, 2020