Testing Column: What Drives Variability in Bar Exam Performance?

This article originally appeared in The Bar Examiner print edition, Spring 2025 (Vol. 94, No. 1), pp. 50–55.

By Drew Weiner, MS, and Joanne Kane, PhD

Imagine a map quiz where students must identify the 50 US states. One student studies the states in order from west to east but runs out of time, learning only the westernmost 25 states. Given the quiz covers all 50 states, it will be an excellent indicator of the student’s knowledge; they will most likely score close to 50% (they could score higher with lucky guesses, or lower with unlucky ones). If, however, the teacher wrote a quiz only covering 5 states, the quiz would be a worse indicator of this student’s true score.

Why? Not only is it impossible to score exactly 50% on a five-item assessment, but, with only five states, the test could easily overrepresent western states, leading to a score overestimate for such a student. The quiz could conversely underrepresent the western states, leading to a score underestimate. Furthermore, a single lucky guess on such a short quiz would have a large influence on the student’s overall score. To minimize the impact of luck and ensure that the map quiz accurately represents students’ knowledge, the quiz should (1) cover a selection of states that is representative of the entire country and (2) cover as many states as possible.

This thought exercise helps illustrate various factors that might—and might not—affect both an individual’s performance on an exam and that of different groups as a whole (i.e., aggregated across individuals). In every spring issue of The Bar Examiner—including this one—NCBE publishes bar exam pass rates broken down by several factors, including by jurisdiction and source of legal education. This annual statistics issue also includes pass rates based on whether candidates were first-time exam takers versus repeaters and whether candidates were taking the examination as a condition of reinstatement after disbarment or suspension.1 Each year’s bar exam results often show familiar patterns: pass rates tend to be higher in July than in February. First-time takers tend to have higher pass rates than repeaters (often much higher).2 Pass rates tend to be higher among candidates who attended ABA-accredited law schools in the United States than those who attended law schools outside the United States. And examinees seeking reinstatement tend to perform worse than other examinees.

Despite the familiar patterns across time, each year’s statistics can include differences that might appear to be unexpected. For example, a recent year-over-year comparison in one jurisdiction showed a change from an 86% pass rate one February administration to a 64% pass rate the following February administration, coinciding with a 25-point decrease in the average Uniform Bar Examination (UBE) score. How did such a sizeable change happen from one year to the next? Did the examinees not study the necessary material proportionately, or could the exam itself be to blame, as in the case of the hypothetical map quiz above? This article will outline some possibilities and explain which are and are not likely to contribute to large differences in exam performance.3

Possibility #1: The Exam Content Changed

One factor that could influence performance is whether exam content has changed. We can talk about content changing in two senses: the collection of questions being different, or the sampled content domain being different.

Returning to the map quiz example, if we gave two 50-question quizzes asking students to identify all 50 US states, the content and the content domain would be identical on each; the “sample” of states would match the “population” of states. Each quiz would ask about all states, and the content domain would consistently be the US states. If we gave two 5-question quizzes asking students to identify a subset of states, however, the sample (5) would not fully overlap with the population (50). We could ask about the same states each time, or we could ask about two different sets of states. The content domain would be the same, but the questions themselves would differ and, as a result, so could the overall difficulty. Second, the content domain could change. If, for example, in the first quiz we asked only for states but in a second quiz asked for both the states and the state capitals, the content domain would have changed (i.e., expanded). Note that future Bar Examiner articles will explore comparability and concordance of scores from different exams with somewhat different content domains.

To maintain exam security and fairness, NCBE does not administer the same bar exam form multiple times. The set of questions asked is not identical from administration to administration, though some individual items are repeated across forms for purposes of equating. However, the content domain is held constant. For each exam administration, a group of legal and psychometric subject matter experts construct a unique exam consisting of items that sample from the same content domain as previous exams, as described in the Multistate Bar Examination (MBE) Subject Matter Outline.4 Items are selected and carefully placed so that the exam difficulty is maximally similar no matter when a candidate takes it.5 After the exam is composed, administrators work hard to ensure the exam is consistent in its delivery methods, its time limits, its minimally distracting environment, its lack of access to outside resources, and its format.6 Thus, although changes in exam content can have a large impact on exam performance in general, by design they are unlikely to have much impact on bar exam performance specifically.

Possibility #2: The Scoring Model Changed

With classroom assessments such as a map quiz, it’s easy to imagine that a teacher might change their scoring methods over time. Or that different teachers would approach grading with varying levels of leniency. Partial credit could be awarded, or not. Close-but-not-correct answers could receive credit, or not. Spelling could matter, or not. A teacher could penalize wrong answers to discourage guessing, or not. A teacher could ask students to wager points per question based on their confidence. Questions very few students got correct could be dropped from scoring and instead considered “extra credit.” In low-stakes assessments, grading tends to be flexible and can change over time, and differences in strictness across teachers or time might be tolerated.

In contrast, for high-stakes assessments such as the bar exam, scoring is consistent and very tightly controlled. Any remaining differences in exam difficulty between administrations (even after the careful test form construction process described above) are minimized using equating and score-scaling procedures. As noted, the MBE includes a number of previously used equator items to gauge changes in difficulty over time and adjust scores so difficulty remains constant.7 The Multistate Essay Examination (MEE) and Multistate Performance Test (MPT) are scaled to the MBE; this means the different grading ranges jurisdictions use, the different subject matter experts who grade the exams, and the different cohort compositions against which an individual essay might be compared do not advantage or disadvantage an examinee for testing during a particular MEE and MPT administration.8 This work minimizes changes in bar exam composition and difficulty over time.

Possibility #3: The Examinees Changed

A unique group of candidates take each administration’s bar exam. Both the total number of candidates and the cohort’s composition have never been identical to the previous administration’s cohort. Although cohorts are similar in many ways and shifts typically happen at a slow pace, the number of people entering law school changes over time,9 and with it, so does law school selectivity; candidates possess different levels of preparedness, and variation in bar exam performance over time reflects this. Although many stakeholders feel at ease when average bar exam performance is steady in their jurisdiction, this is only desirable insofar as it reflects stable candidate readiness to practice law, and this aggregate readiness can fluctuate between administrations (e.g., due to the opening or closing of a law school or economic incentives driving lawyers to and from different jurisdictions).

The quality of instruction could also change over time. Improvements in instruction clarity, efficiency, and/or effectiveness could positively affect pass rates for recipient cohorts. Factors outside school could matter too. Revisiting the map quiz example, if maps and images of a state dominated the news cycle in the days leading up to a quiz, students might have an easier time identifying said state because it would be top of mind. Similar effects could be seen if popular media such as video games or TV shows included some focus on geography. A cohort more likely to engage with that media might cause them to outperform another cohort less likely to engage with that media.

Definitions of Terms

Assessment: A method of gaining information about a person’s knowledge, skills, or other characteristics.

Content Domain: A set of knowledge, skills, or other characteristics measured by an assessment.

Equating: A statistical process through which the results of an assessment are transformed so that they can be compared.10

Equator items: Items on an assessment that are unchanged over time (e.g., in July 2023 and July 2024), so that differences in performance can be attributed to differences in the examinees rather than differences in the assessment. For example, if the July 2023 MBE cohort performed worse than the July 2024 MBE cohort, but the two cohorts performed identically on the equator items, this suggests that the cohorts had the same proficiency, but the July 2023 MBE assessment was more difficult.

Score scaling: A process through which the results of an assessment are transformed so that they are expressed on the same score scale as another assessment. In the context of the bar exam, written exam scores are scaled such that they have the same mean and standard deviation as the MBE for the same cohort. Scaling maintains the rank order and relative differences between examinees. For example, if an examinee scored one standard deviation higher than the mean written score in their jurisdiction, after scaling, their scaled written score would be one standard deviation above the scaled written score mean in their jurisdiction.11

Scaled exam scores: A score produced for each candidate after raw scores have been translated to a standardized (scaled) score scale after equating is complete.

Standard deviation (SD): A statistic that describes the extent to which a set of observations are spread out relative to their mean. When observations are normally distributed, around 68% of observations are located within +/− 1 standard deviation from the mean, around 95% of observations are located within +/− 2 standard deviations from the mean, and around 99% of observations are located within +/− 3 standard deviations from the mean.

Test item: A question, prompt, task, problem, or other element generating a scoreable response from an examinee.

Possibility #4: There Are Too Few Observations for Stability

Just as longer assessments tend to be more reliable, bigger exam cohorts tend to be more stable. Large pass rate differences are more likely to be observed in smaller jurisdictions. At an extreme, consider a jurisdiction with just five examinees: a difference of just one examinee passing versus failing means a swing of 20% in the pass rate. In a larger jurisdiction of 500 examinees, the difference of just one examinee passing versus failing is much smaller (0.2%).

In general, when more measurements are taken, lucky overestimates and unlucky underestimates tend to occur at similar rates and balance each other out; the average tends to be closer to the true score. This is sometimes called the law of large numbers. We see this in many everyday happenings: it is not surprising to flip a coin and get two heads in a row, but it is virtually impossible to flip a coin and get two hundred heads in a row. The more times you flip the coin, the closer the results will tend to converge on the true, underlying heads rate of 50%. Similarly, if you roll a die many times, you will tend to see each face at a similar rate. The bar exam is quite long so that it can provide the most accurate and stable estimate of examinees’ true knowledge as possible, minimizing the impact of random luck.

Larger groups (such as all the examinees for the July 2024 administration) will tend to be more stable over time than smaller groups (such as the group testing in any single jurisdiction, particularly a comparatively smaller jurisdiction). Just as an examinee’s score might fall because they encounter a question they did not prepare for, a jurisdiction’s average might fall because an examinee performs below their true level of knowledge or ability due to illness, exhaustion, or any other reason. Seeing a question for which you did not prepare is unlucky but has little effect on the final score because it is only one question amid multiple days of testing. Similarly, a single underperforming examinee will have a minimal impact on the jurisdiction average if the jurisdiction normally seats many examinees.

To demonstrate this, for each jurisdiction and administration, we subtracted the average UBE score from the prior corresponding administration’s average UBE score for administrations from February 2013 to July 2024.12 For example, if a jurisdiction’s July 2022 average UBE score was 280, and the same jurisdiction’s July 2023 average UBE score was 275, there is a difference of −5 scaled score points. This would represent one data point in Figure 1, which shows changes within jurisdictions from year to year.

These year-to-year differences in average UBE score are organized by the average number of examinees tested in a jurisdiction. The smallest jurisdictions (top panel), which seat an average of fewer than 100 examinees per administration, have the greatest amount of variation in their year-to-year changes (SD = 7.54). The middle-sized jurisdictions (100–400 examinees; middle panel) have slightly less variation (SD = 6.24), and the largest jurisdictions (over 400; bottom panel) have the least variation (SD = 4.17).

All three groups show variation. There are some jurisdictions with zero change, but most jurisdictions show at least some change. In the biggest group, UBE score differences tend to be smaller than 10 percentage points (usually closer to 5 percentage points). The smaller the jurisdiction, the likelier it is that larger performance differences will be observed. Figure 1 shows this: larger jurisdictions get influenced less by the occasional unusual score, and they show less variation overall from year to year compared to smaller jurisdictions.

The example jurisdiction mentioned earlier in the article, where the pass rate changed from 86% to 64%, was a particularly small jurisdiction, with around 40 examinees per administration on average. Normally group performance does not fluctuate this much, as Figure 1 illustrates. However, even with very strict control and great efforts to make bar exams comparable over time through careful test assembly and equating, diligent and consistent administration, and bar exams’ representativeness and length, we observe differences in performance across time.

Figure 1: Year-to-year UBE scaled score changes for all administrations from February 2013 to July 2024 grouped by average jurisdiction size

What does all this mean for stakeholders? Both candidates seeking to become licensed lawyers and individuals in need of legal services can take comfort that many psychometricians and legal subject matter experts, both within and outside NCBE, have striven to ensure that the bar exam is a representative sample of the knowledge needed to practice law, and that score differences primarily represent real differences in examinee knowledge—not luck or other factors irrelevant to the practice of law. Stakeholders interested in average performance within their jurisdiction (whether considering pass rates or average exam scores) should understand that changes in performance may be driven by real differences between cohorts taking the bar exam. Average performance in smaller jurisdictions is more noticeably affected by fluctuations in groups of test takers over time. Smaller jurisdictions especially should treat administration averages with some skepticism—changes could be driven by real, population-level changes affecting all examinees, but a change in pass rate might simply reflect the instability that comes with too few observations (e.g., taking a map quiz covering only five states). We reiterate that this natural instability might affect jurisdictions where a small number of candidates’ scores are aggregated to form an estimate, but this does not apply to individual examinees whose scores result from hundreds of questions over multiple days of testing. Some score variation is normal in all jurisdictions, and jurisdictions with fewer examinees should expect larger amounts of variation.

Notes

The Bar Examiner website has an archive of recent statistics at https://thebarexaminer.ncbex.org/statistics/. (Go back)
See “First-Time Exam Takers and Repeaters in 2024.” (Go back)
Performance can be conceptualized as actual scaled exam scores, or as aggregate pass rates; the principles described in the rest of the article can be applied to either metric. (Go back)
See the MBE Subject Matter Outline. (Go back)
Joanne Kane, PhD, and April Southwick, “The Testing Column: Writing, Selecting, and Placing MBE Items: A Coordinated and Collaborative Effort,” 88(1) The Bar Examiner 46–49 (Spring 2019). (Go back)
Mark A. Albanese, PhD, “The Testing Column: The Bar Admission Administrator and NCBE: The Dynamic Duo,” 87(1) The Bar Examiner 55–58 (Spring 2018). (Go back)
Mark A. Albanese, PhD, “The Testing Column: Equating the MBE,” 84(3) The Bar Examiner 29–36 (September 2015). (Go back)
Judith A. Gundersen, “It’s All Relative—MEE and MPT Grading, That Is,” 85(2) The Bar Examiner 37–45 (June 2016); Mark A. Albanese, PhD, “The Testing Column: Let the Games Begin: Jurisdiction-shopping for the Shopaholics (Good Luck with That),” 85(3) The Bar Examiner 51–56 (September 2016). (Go back)
Mark A. Albanese, PhD, “The Testing Column: July 2019 MBE: Here Comes the Sun; August 2019 MPRE: Here Comes the Computer,” 88(3) The Bar Examiner 33–35 (Fall 2019). (Go back)
For a deeper explanation of NCBE’s equating process, see Mark A. Albanese, PhD, “The Testing Column: Equating the MBE,” 84(3) The Bar Examiner 29–36 (September 2015). (Go back)
For a deeper explanation of NCBE’s score-scaling process, see “The Testing Column: Scaling, Revisited,” 89(1) The Bar Examiner 68–75 (Fall 2020). (Go back)
July administrations were subtracted from July administrations, and February administrations from February administrations, because July and February administrations are consistently different in meaningful ways. For example, July administrations involve mostly first-time examinees, whereas February administrations contain a high proportion of examinees who have previously attempted the exam, so July and February administrations tend to have large differences in pass rates. See supra note 2. (Go back)