Comparability
The regulators of external qualifications produce comparability reports for a range of qualifications based on an evaluation of assessment practices in centres in England, Wales and Northern Ireland. These reports are available to download from this page.
Comparability studies
The regulators produce comparability reports for a range of qualifications based on an evaluation of assessment practices in centres in England, Wales and Northern Ireland. These reports are available to download from this page.
Techniques for monitoring the comparability of examination standards
We have published a state-of-the-art review of techniques for monitoring comparability over the past half century. The book, edited by Paul Newton, Jo-Anne Baird, Harvey Goldstein, Helen Patrick and Peter Tymms, represents the outcome of the review.
A summary of the book
Results from public examinations matter. They form the basis for judging students, teachers, schools, local authorities and the nation as a whole. For these judgements to be fair, standards must be comparable. Although different students may sit different examinations, the same standard should be applied from one examining board to the next, from one year to the next, across tiers of entry, and between examinations in different subject areas.
Numerous techniques have been developed to investigate the examining boards' success, or otherwise, in ensuring comparability. Some of these have been primarily statistical and some have been based on the judgement of experienced examiners. Yet despite more than 50 years of research in the area, there is still no consensus on which technique is the best, or whether one family of techniques is better than any other. In fact, there is not even a clear consensus on what it might actually mean for two examinations to share the same standard when they assess different competencies in different ways, which is often the case when monitoring comparability.
In light of this, we commissioned a state-of-the-art review of techniques for monitoring comparability over the past half century. The intention was to describe the different methods, to highlight their various strengths and weaknesses, and to consider the progress made in monitoring comparability over the past half century.
Chapter 1: Contextualising the comparability of examination standards - Paul Newton
Chapter 2: A brief history of policies, practices and issues relating to comparability - Kathleen Tattersall
Chapter 3: Awarding examination grades: current processes and their evolution - Colin Robinson
Chapter 4: Alternative conceptions of comparability - Jo-Anne Baird
Chapter 5: The demands of examination syllabuses and question papers - Alastair Pollitt, Ayesha Ahmed and Victoria Crisp
Chapter 6: Cross moderation methods - Robert Adams
Chapter 7: Paired Comparison Methods - Tom Bramley
Chapter 8: Common test methods - Roger Murphy
Chapter 9: Common examinee methods - Robert Coe
Chapter 10: Multilevel modelling methods - Ian Schagen and Dougal Hutchinson
Chapter 11: Comparability monitoring: progress report - Paul Newton
Chapter 1: Contextualising the comparability of examination standards - Paul Newton
The purpose of this introductory chapter is to set the scene for the chapters that follow. This will involve providing:
- a sense of the social and political climate in which debates over comparability are conducted in England (Part 1)
- a description of the unique organisational, structural and procedural components of examining in England (Part 2)
- an introduction to what comparability might actually mean in this context (Part 3), and
- an indication of steps taken, through regulatory mechanisms, to facilitate the comparability of examination standards (Part 4).
This chapter which describes developments in education and assessment in England from the mid 19th century to the present day is in three parts:
- the emergence of a ‘national system’ (the 1850s to the end of the Second World War);
- the development and expansion of the system (1945 to the 1990s);
- the emergence of a regulated system from the 1990s onwards.
The role of comparability at each stage of these developments is highlighted.
Back to summary
Chapter 3: Awarding examination grades: current processes and their evolution - Colin Robinson
The aim of this chapter is to describe the process by which each awarding body tries to ensure that the grading of candidates is comparable, no matter when, by whom, or on what aspects of the subject the candidate is assessed. These processes have developed over the years, mainly as a result of the experiences of the awarding bodies themselves, but also because of the impact of regulation, which has grown over time. The chapter describes this key activity of the awarding bodies, looking at the personnel involved, the information that is available to them and the decisions they have to make in order to ensure that the results maintain the multifaceted comparability requirements laid upon the awarding bodies. Changes that have been made over the years and the reasons for them are also outlined. The April 2007 edition of the code of practice lays out the comparability requirements quite clearly:
The awarding body’s governing council is responsible for setting in place appropriate procedures to ensure that standards are maintained in each subject examined from year to year (including ensuring that standards between GCE and GCE in applied subjects, as well as between GCSE and GCSE in vocational subjects, are aligned), across different [syllabuses] within a qualification and with other awarding bodies. QCA (2007)
Chapter 4: Alternative conceptions of comparability - Jo-Anne Baird
Comparable examinations have to be at the same standard. But what do people mean by ‘examination standard’ and what kinds of comparability are expected? How is evidence to be gathered about these types of comparability and are all of these approaches valid? This chapter outlines different definitions of examination
comparability used in England by academics and the expectations of the media and general public. The purposes to which assessment results are put are discussed, as the alternative conceptions of examination comparability are linked to the uses of the assessment results. Given that there are different approaches, some commentators
have proposed that we should select a single definition of examination standards and stick to it, so that the system is clearer and false expectations are not raised about what the examination system can realistically deliver. Whether a particular definition of examination standards can be prioritised above others is considered, as well as the implications of so doing.
Back to summary
Chapter 5: The demands of examination syllabuses and question papers - Alastair Pollitt, Ayesha Ahmed and Victoria Crisp
Aim
Examiners and many varieties of commentator have long talked about how ‘demanding’ a particular examination is, or seems to be, but there is not a clear understanding of what ‘demands’ means nor of how it differs from ‘difficulty’. In this chapter we describe the main efforts that have tried to elucidate the concept of demands, and aim to establish a common interpretation, so that it may be more useful in future for the description and evaluation of examination standards.
Definition of comparability
No definition of comparability is necessarily assumed. Sometimes it is apparent that researchers operate with a default assumption that two examinations are expected to show the same level in every aspect of demand, but it would be quite reasonable for one of them to, for example, require a deeper treatment of a smaller range of content than the other; comparability then requires these differences in the demands somehow to balance each other out. It is asking a lot of examiners to guarantee this balance, and a less ambitious approach requires only that the differences are made clear to everyone involved.
Comparability methods
Several methods have been used to look at demands, including: asking informally for impressions of the overall level of demand; asking for ratings of specific demands, or aspects of demand; systematic questionnaires addressing a set of standard demands applicable to many examinations; rating on abstract concepts of demands identified from empirical research. Throughout this work there has been a constant research
aspect, as no fully satisfactory system has been developed so far. Theoretical input has come from research in the area, and also from, in particular, taxonomies of cognitive processes, and personal construct psychology.
Strengths and weaknesses
Paying attention to the demands contained within examinations broadens the context of comparability studies, adding a third dimension to comparability. Rather than being just a matter of the ability of the students and the difficulty of the questions, a focus on demands addresses questions about the nature of the construct being
assessed: statistical analysis may tell us that two examination grades are equally difficult to achieve, but it cannot tell us if they mean the same thing in terms of what the students can do with them. We are still, however, trying to develop a system to make this kind of comparison secure, and to establish a common set of meanings to
the various terms in use to describe demands.
Conclusion
Demands play an important role in examining in that they are the principal means by which examiners try to control the nature of the construct. When they are constructing the papers and the mark schemes in advance of the test, they have an idea of what the students’ minds should be expected to do to achieve a particular
grade; by manipulating the demands they try to design tasks that are appropriate for this purpose. To the extent that they succeed, appropriate standards are built into the examination in advance. In this chapter we describe three aims for the study of examination demands. We argue that a description of the nature of the demands is
worthwhile in itself, that this can provide a basis for comparing different examinations, and that both of these are valuable even if it is not possible to go further and declare that they differ, or do not differ, in overall demand.
Chapter 6: Cross moderation methods - Robert Adams
Aim
The aim of this chapter is to give an account of cross-moderation methods of studying comparability of examination standards – that is, methods based on the scrutiny and judgement of candidates’ examination work – in order to describe how methods have evolved over time and to summarise the current understanding of
how the methods work.
Definition of comparability
The methods described in this chapter pursue the weak-criterion-referencing definition of comparability. The criteria exist in the minds of experienced teachers and examiners, and the methods described here rely on their applying those ‘criteria’ to specially selected samples of candidates’ work, as expressed in examination
scripts.
Comparability methods
Cross-moderation methods are then simply systematic ways of looking at candidates’ work, that ought to look to be of the same standard.
History of use
Comparability approaches based on looking at candidates’ work go right back to the beginnings of collaborative work by the then GCE boards – forerunners of today’s awarding bodies – in the 1950s, though the first studies to be published date from the 1960s. Much of the following chapter is concerned with tracing the history and
evolution of the method since then.
Strengths and weaknesses
The undoubted strength of cross-moderation methods is that they appear ‘sensible’, that is to say that a lay person would understand how they address the problem, for example, of comparability between boards. From the practitioner’s points of view, the methods also closely mimic parts of the standards-setting and grade-awarding procedures. The weakness of the method is that the findings can never be unequivocal. The nature of examination standards and of human judgements is so nebulous that little definitive can be said of them.
Conclusion
The evolution of cross-moderation methods set out in this chapter represents a great deal of work by the awarding bodies over the 50-plus years that have seen comparability work undertaken. In that time, the nature of examinations has changed, as has the collective understanding of what examination standards actually
might be.
Back to summary
Chapter 7: Paired Comparison Methods - Tom Bramley
The aims of this chapter are:
- to explain the theoretical basis of the paired comparison method;
- to describe how it has been used in the cross-moderation strand of inter-board comparability studies;
- to discuss its strengths and weaknesses as a judgemental method of assessing comparability.
Definition of comparability
The chapter follows the approach of Hambleton (2001) in distinguishing between content standards and performance standards. Cross-moderation exercises are shown to be comparisons of performance standards between the examination boards. The judges are expected to make judgements about relative performance standards in the context of possible differences in content standards. In this chapter the performance
standard is conceived as a point on a psychometric latent trait.
Comparability methods
In a paired comparison task a judge is presented with a pair of objects and asked to state which possesses more of a specified attribute. In the context of this chapter, the objects are examinees’ scripts on a specified grade boundary from different examination boards, and the attribute is ‘quality of performance’. Repeated
comparisons between different pairs of scripts across judges allows the construction of a psychological scale (trait) of ‘perceived quality’. Each script’s location on this scale depends both on the proportion of times it won and lost its comparisons, and on the scale location of the scripts it was compared with. Differences in the mean
location of scripts from the different boards have been taken to imply a lack of comparability – that is, differences in performance standards. The chapter also describes a modification of the method to use rank-ordering rather than paired comparisons to collect the judgements (Bramley, 2005a). The underlying theory and method of analysis are the same.
History of use
The psychometric theory underlying the paired comparison method was developed by the American psychologist Louis Thurstone, who used it to investigate a wide range of psychological attributes (e.g. ‘seriousness of crime’). It was first used in comparability studies to compare some of the (then) new modular A level syllabuses
against their linear equivalents (D’Arcy, 1997), and since then has been the favoured method for the cross-moderation strand of inter-board comparability studies, which are the focus of this chapter. It has also been used to investigate comparability of standards over time, and, in its rank-ordering guise, as a technique for standard maintaining – enabling a known cut-score on one test to be mapped to an equivalent cut-score on a new test.
Strengths and weaknesses
The method has several advantages in the cross-moderation context. First, the individual severities of the judges are experimentally removed – that is, it does not matter how good (in absolute terms) they think the scripts they are judging are: all that matters is their relative merit. Second, the analysis model naturally handles missing data because the estimate of the scale separation between any two scripts does not depend on which other scripts they are compared with. This means that data can be missing in a non-random way without affecting the results. Third, fitting an explicit model (the Rasch model) to the data allows investigation of residuals to detect misfitting scripts and judges, and judge bias. Finally, the approach is simple and flexible, allowing the design of the study to be tailored to the needs of the particular situation. One drawback to using the method in this context is its psychological validity when the objects to be judged are as complex as scripts. In Thurstone’s own work the judgements could be made immediately, but here a certain amount of reading time is required. Also, the method assumes that each comparison is independent of the others, but this seems implausible given that judges are likely to remember particular scripts when they encounter them in subsequent comparisons. Unfortunately the paired comparison method is tedious and time-consuming for the judges, a drawback that can be remedied to some extent by using the rank-ordering method.
Conclusion
The paired comparison method of constructing psychological scales based on human judgements is well established in psychometric theory and has many attractive features which have led to its adoption as the preferred method in inter-board comparability studies. Many of the issues arising with the method are due to this
particular context for its application. In practice, the most serious problem has not been with the method but with the design of the studies, which have not allowed differences between boards in terms of mean scale location to be related to the raw mark scales of the different examinations. This has made it impossible to assess the
importance of any differences discovered (for example in terms of implied changes to grade boundaries). Both the paired comparison method and especially the rank-ordering method could easily address this shortcoming in future studies.
Chapter 8: Common test methods - Roger Murphy
Aim
To review the use of common test approaches in comparability research, assessing both the advantages of such an approach and the criticisms that have been levelled against it.
Definition of comparability
The debate about the usefulness of this method partly hinges on the definition of comparability assumed by those considering its use. The approach fits best with a statistical approach to comparability and least well with a standards referenced approach.
Comparability methods
The common test approach relies upon the examinations being compared having been taken by a sample of students, who have also all taken some other common test. The common test results are used as the basis for comparing the standards of different examinations. This is usually undertaken by plotting regression lines, which attempt to estimate the relationship between common test scores and examination
scores.
Strengths and weaknesses
The common test approach is a fairly simple method. It can therefore be easily explained to non-technical individuals who are interested in comparability issues. It is also relatively easy to collect the data needed and draw conclusions. However, it does depend upon the common test having a strong and consistent educational and statistical relationship with the examinations being compared. Critics of the approach point to the fact that common tests rarely have anything like the required relationship with examinations of the type for which comparability studies are required.Conclusion
The method should only be used in circumstances where the relationship between the common test and the examinations to be compared can be studied closely and critically. Any comparability conclusions drawn from such a study need to be interpreted with caution, taking account of the known levels of uncertainty.
Chapter 9: Common examinee methods - Robert Coe
Aim
The chapter aims to provide a full description of the methods that have been used to monitor comparability between different examinations taken concurrently by the same candidate: common examinee methods. Criticisms that have been made of these methods are presented and discussed in relation to different interpretations and
conceptions of comparability.
Definition of comparability
Three conceptions of comparability are considered: performance, in which comparability is defined in terms of observed phenomena in relation to specific criteria; statistical, in which comparability depends on an estimate of the chances of a particular grade being achieved; and construct, in which examinations are seen as indicating different levels of a common linking construct.
Comparability methods
A number of methods are described, including simple comparisons of the grades achieved in a pair of examinations (subject pairs analysis); the aggregation of such paired comparisons to compare a larger set of examinations; Nuttall et al.’s (1974) ‘unbiased mean total’ method; Kelly’s (1976a) method; analysis of variance; average marks scaling; Rasch modelling.
History of use
Most of these methods have been used since the 1970s, though their prominence appears to be less now than in those early days, at least in England.
Strengths and weaknesses
Some of these methods require specific assumptions, such as equal intervals between grades. Other assumptions (for example, that examinations must be uni-dimensional, or that differences in factors such as teaching quality or motivation must be ignored) may depend on how the methods are applied and interpreted. Hence, evaluating the
strengths and weaknesses of the methods is complex. Most of the methods considered, however, are capable of valid interpretation under the right conditions.
Conclusion
Common examinee methods should be part of the systematic monitoring of standards across syllabuses, boards and subjects, though, like all comparability methods, they should be used with caution and judgement. However, their main value is probably in informing the interpretation and use of examination results for
purposes such as selection.
Chapter 10: Multilevel Modelling Methods - Ian Schagen and Dougal Hutchison
Aim
The aim of the chapter is to introduce multilevel modelling as a key methodology for the analysis of data in comparability studies and show how it can be applied in different situations and to different data sets.
Definition of comparability
The main definition addressed is that of Cresswell (1996): Two examinations have comparable standards if two groups of candidates with the same distributions of ability and prior achievement who attend similar schools with identical entry policies, are taught by equally competent teachers and are equally motivated, receive grades which are identically distributed after studying their respective syllabuses and taking their examinations. Cresswell (1996) Following the consideration of interaction effects, the chapter suggests limitations with this definition and suggests consideration of a new and more robust definition.
Comparability methods
Although not a comparability method in the same sense as those described in other chapters, multilevel modelling underpins the quantitative approaches discussed elsewhere. It is a statistical modelling tool, derived from multiple regression with the ability to include within-group clustering at a variety of levels in a unified and
consistent fashion.
History of use
Since its development in the 1980s, multilevel modelling has been applied in a wide variety of fields, including education, although examination comparability studies have been in some ways a minority application. Over recent years it has tended to replace other less sophisticated analysis methods as the preferred statistical approach. A brief review of studies using multilevel methods is included in the chapter.
Strengths and weaknesses
The main strength of multilevel modelling is its power and flexibility, and ability to model a wide range of scenarios and situations. As with all modelling, the weaknesses lie in the quality of the available data and problems with setting up models correctly to represent the important underlying relationships.
Conclusion
The main conclusions of the chapter are as follows
- The advantages of multilevel modelling far outweigh any perceived disadvantages for this kind of work.
- Modelling should explore all possible aspects of comparability, including interactions between boards and key measures such as prior attainment.
- Where such interactions are detected, it is not clear that comparability is maintained – a new definition may be needed to encompass this.
Chapter 11: Comparability Monitoring Progress Report - Paul Newton
This conclusion presents a synthesis of the major themes that have emerged throughout the book. Some tentative answers are offered to the questions that motivated the review and some of the underlying tensions that make research in this field so challenging are revisited. The conclusion ends by considering prospects for the future of comparability monitoring in England.
