13 Best Practices for Grading Essays and Performance Tests

This article originally appeared in The Bar Examiner print edition, Winter 2019-2020 (Vol. 88, No. 4), pp 8–14. By Sonja Olson

When grading essays and performance tests for the bar examination, fairness, consistency, and focus are the cornerstones of good grading. In this article, NCBE’s MEE/MPT Program Director shares best practices for grading these written components to ensure that they serve as reliable and valid indicators of competence to practice law.

Opinions may vary about what should be tested on the bar exam, but if there is one point of agreement, it is that lawyers need to be skilled at communicating in writing. And communicating in writing means much more than using proper syntax, grammar, and vocabulary. Lawyers must be able to adjust their writing to a variety of audiences, such as clients, courts, opposing counsel, and legislators. Essay questions and performance tests are therefore integral to evaluating whether an individual should receive a license to practice law.

The Multistate Essay Examination (MEE) and the Multistate Performance Test (MPT) would not perform as reliable testing components without the dedication and care exercised by the graders in every jurisdiction that uses those exam components. In this article I share some insights and best practices that I’ve learned over the years from our Grading Workshop facilitators, from our MEE and MPT Drafting Committees, and without a doubt, from the graders (veterans and newcomers) who participate in NCBE’s Grading Workshop after every February and July bar exam administration.1 Following these grading best practices ensures that the MEE and MPT serve as valid and reliable measures of basic competence to practice law.

1. Know the question (and the answer).

Every MEE question comes to the graders with the Drafting Committee’s analysis of the issues raised by the question and a discussion of the applicable law. In addition, we provide grading guidelines at the Grading Workshop. These guidelines, generally one to two pages, distill the issues discussed in the MEE analyses but also offer suggestions for distinguishing answers and may identify common areas where examinees struggle. This information is based on the workshop facilitator’s review of at least 30 actual MEE answers, which are sent to NCBE by jurisdictions after the bar exam. For the MPT, the drafters’ point sheet identifies the issues raised in the MPT and the intended analysis.

Familiarity with the grading materials not only allows a grader to give credit where it is due but also ensures that a grader can readily identify answers containing extraneous discussion that may be accurate (such as memorized portions of bar review outlines) but is not pertinent to a discussion of the issues raised by the problem.

Particularly with performance tests, which provide the relevant law, examinees may reiterate sentences or more from the statutes, regulations, or cases in the test booklet. MPTs also present a more expansive collection of facts for examinees to master, and thus there is the temptation to recite extended portions of the facts in an answer. Familiarity with the text of these questions, from the beginning of grading, will make it much easier to identify examinees who are merely regurgitating material as opposed to synthesizing the relevant facts and law and producing a cogent analysis.

2. Know the applicable law.

Graders of the MPT have the luxury of having all examinees working from the same legal authorities, as the MPT is a closed-universe exam—that is, all the relevant law is provided as part of the MPT. So an MPT grader should have no worries that an examinee is referencing an alternative, but valid, legal doctrine—if it’s not in the Library portion of the test booklet, it’s probably not analysis that should receive credit.

New MEE graders, however, often ask whether it is expected that answers to MEE questions will contain the same level of analysis and legal citations as provided in the Drafting Committee’s analyses. The short answer is no—the MEE analyses are very detailed because the MEE Drafting Committee recognizes that graders may be assigned to grade questions in subject areas that are not frequently encountered in their law practices. The MEE analyses contain the legal authorities relevant to the problem and often some background material to help orient the grader. Graders of MEE questions may want to review the authorities cited in the analyses or other treatises and casebooks if grading a subject outside of their regular practice area.

3. Know the grading scale that your jurisdiction uses.

At the Grading Workshop, NCBE uses a six-point scale when discussing the grading of essays and MPTs. Some jurisdictions use another scale, such as 1 through 7 or 1 through 10. What matters is that the score scale is manageable enough that graders can make consistent and meaningful distinctions among answers without getting frustrated by trying to determine where an answer fits on an overly granular scale.2 All graders in a jurisdiction should be using the same scale and be in agreement on when an answer is so deficient as to warrant a zero (see best practice #13).

Graders should also know whether their jurisdiction requires that the grades conform to a particular distribution, such as a curve or equal percentages in each grading category. Note that one method isn’t preferred over another—the point is that all graders should be on the same page.3

4. Focus on rank-ordering.

No grader should bear the weight and added stress of believing that the grade they assign to an essay or performance test is what will tip the scale for that examinee and determine whether he or she passes the bar exam. The emphasis should be on rank-ordering the papers, not on whether an individual paper receives a passing or a failing grade. The score given to an essay by a grader is essentially a “raw” score because those essay grades will be scaled to the jurisdiction’s Multistate Bar Examination (MBE) scores.4 Only then will the “real” grade for that specific essay be determined, which will then be added to that examinee’s MBE score, other essay grades, and grades on any other bar exam components to produce the final score.

That being said, a grader should be able to articulate why a paper ends up at a different point on the grading scale vis-à-vis those papers receiving a higher or lower grade. At times, most papers may drop easily into particular “piles” on the grading scale based on simple criteria—for instance, that they cover the first two issues well but then do not reach a correct conclusion on the third issue.

Graders should turn off their inner editor and focus on how well the paper has answered the call and demonstrates the examinee’s ability to reason and analyze compared to the other papers in the pile.

5. Achieve calibration to ensure consistency in rank-ordering.

Fairness to all examinees means that it shouldn’t matter when their papers are graded or by whom. Calibration is the means by which graders develop coherent grading judgments so that rank-ordering is consistent by a single grader as well as across multiple graders. The recommended practice is that a grader review at least 30 papers before grading “for real” to see what the range of answers is. Note that for both multiple graders and single graders, answers for each point on the grading scale should be identified before the “real” grading begins. This could require reviewing more than 30 papers.

For multiple graders: Reviewing at least 30 papers works well when there are two or more graders for a question. As graders read the same papers in the calibration packet, they should pause after every five or so answers and discuss what grades they have assigned. If they have graded the same papers differently, they should discuss those papers and come to an agreement for each paper. This process of grading, discussing, and resolving differences should continue through the whole calibration packet or until graders are confident that they are using the same criteria to differentiate papers.

For a single grader: For a single grader, it is just as important to review a calibration packet of 30 or more papers. The papers can be sorted into piles for each point on the grading scale. After reviewing the first 10 or 15 papers, the grader should revisit the grades given to the first papers to see if the initial grade still holds or if the paper in fact belongs in a different pile. Each pile should then be reviewed to verify that the papers in it are of a consistent quality. One approach that some graders have found helpful is to first separate answers into three piles (poor, medium/average, and good) and then review the papers in each pile, separating them into the 1s and 2s, the 3s and 4s, and the 5s and 6s.

6. Combat “grader drift.”

Graders can “drift,” or begin grading papers inconsistently, for a variety of reasons. Fatigue is a common reason, as is hitting a string of very poor (or very good) papers so that the next one seems very good (or very bad) when it is merely average. To ward against “grader drift,” all graders should have some answers from the calibration packet embedded into the papers they grade, with the score from the calibration session hidden. After grading the embedded paper (which may be on colored paper or otherwise marked as a part of the calibration set), the grader can compare the just-assigned grade with that from the calibration session and determine whether drift is occurring. For multiple graders, embedding the same answer from the calibration packet at the same point in each grader’s pile provides an opportunity to check that the graders are internally consistent and still applying the same standards.

7. Spread out grades over the entire score scale.

Rank-order grading only works as an effective assessment tool if graders take care to use the entire score scale. This does not mean that the final grades fit a particular curve. Rather, even a grader in a smaller jurisdiction who has fewer than 100 papers to grade should have no problem finding papers that slot into each point on the scale. There will be fewer 1s and 6s (in the case of a 6-point scale) and likely more 3s and 4s, with the number of 2s and 5s probably falling somewhere in between—but there will be papers that a grader, with confidence and justification, may reasonably place at each given point on the scale. While there are times when a question may be easier and most examinees appear to do well, a grader will still be able to find valid points of distinction among answers that will allow the grader to spread out the scores.5 Some graders may find it helpful to initially use pluses and minuses when grading and then to review those 4- and 4+ answers, for example, to see if they really belong in the 3 or 5 piles.

The table below illustrates the importance of spreading out grades. If there are two graders, each grading answers for different questions, and Grader A decides to use the whole score scale of 1 through 6 but Grader B thinks that all examinees performed about the same and gives out only 3s and 4s, the resulting combination of the scores from Grader A and Grader B demonstrates that it is really Grader A who, by taking care to use the entire range of possible scores, is determining how well, or how poorly, each examinee does overall.

Examinee	Grader A	Grader B	Average Score
A	3	3	3
B	4	3	3.5
C	2	4	3
D	5	4	4.5
E	6	3	4.5
F	1	4	2.5

Lack of calibration between graders of the same question is unfair to examinees because their scores will be affected not by the quality of their answers but by whether they got the “easy” or the “hard” grader. On a similar note, if graders of different questions fail to spread out their grades, the questions whose grades are “bunched up” will ultimately have less impact on examinees’ overall scores.

8. Approach each paper as an “empty bucket”—that is, look for reasons to give credit.

Just as we encourage graders at the Grading Workshop to avoid thinking that the pass/fail line is whether a paper receives a 3 as opposed to a 4 on a 6-point scale, we encourage graders to approach each paper as an “empty bucket” and to view their task as searching for points to add to the bucket. It is much more likely that a grader can be consistent across papers in what he or she will give credit for, instead of attempting to be fair and consistent in all the ways a paper could be penalized.

9. Grade in a compressed time period.

Some jurisdictions, where the number of examinees means that grading cannot be completed over the course of a long weekend, may set targets for the number of papers that a grader can reasonably grade in a day. Certainly, grading is not something that should be rushed. But it is much easier to maintain calibration if the grader doesn’t have to get reacquainted with the details of the legal analysis and the quality of the answers because of a start-and-stop grading process spread out over several weeks. To the extent possible, grading should be done over a shorter time period.

10. Know the additional factors to consider when assigning grades.

Know what factors are legitimate grounds for assigning different grades to papers. Obviously, the content and substance of the answer is the first indication—what parts of the question did the examinee answer correctly? But other qualities are valid reasons for distinguishing papers.

Response to the call of the question

For both MEEs and MPTs, the answer should respond to the call of the question asked—and not the question that the examinee may have preferred to answer. For example, the examinee may launch into a discussion of whether a contract was validly entered into when the call of the question specifically asks for an assessment of the amount of damages the plaintiff is likely to recover. If that examinee then goes on to provide discussion that does respond to the specific call, the examinee will receive credit for the good content and at that point, the grader can generally ignore the extraneous material. Examinees who inflate their answers with a lot of extraneous material effectively penalize themselves: including the irrelevant material leaves the examinee less time to devote to the legal issues that are raised by the call. If two papers have approximately equal good content, but one is cluttered by unnecessary material, the one that adheres to the relevant issues is the better answer, although depending on the overall group of answers, those two papers could end up in the same pile.

Additional factors to consider, especially with MPTs, are the answer’s format, structure, and tone, and whether the examinee followed directions (e.g., if the task is to draft a letter to opposing counsel, which should be a persuasive piece, and instead the examinee writes an objective memorandum, this should be taken into account in determining the examinee’s grade). Finally, the analysis should state the applicable legal standard, marshal the relevant facts, and apply the law to those facts in the problem.

Accuracy in stating facts

In a similar vein, examinees may get a fact or two wrong when writing their answers. Even with MEE questions, which are generally just one page long, in the rush to produce an answer in 30 minutes, it is not unusual for an examinee to misread or misstate facts. The mistake may be very minor (e.g., getting a character’s name wrong). If it is clear from the context whom the examinee is discussing, such an error can probably be ignored. But if an examinee misstates a fact and then hinges part of the analysis on that incorrect statement, that should likely be considered when grading. After all, an important lawyering skill is paying attention to the facts that matter and getting them right when presenting a legal argument or analysis.

Written communication skills

For jurisdictions that have adopted the Uniform Bar Examination (UBE), the UBE Conditions of Use mandate that graders take into account written communication skills when grading, although no discrete weight is provided for that component. Each jurisdiction may have specific guidelines for how its graders should handle papers that are riddled with typos, exhibit poor grammar, or contain irrelevant information (legal or factual), among other things. Obviously, there will be a point where a paper’s typos and poor grammar will make it impossible to discern whether the examinee does comprehend the relevant legal principles, and in such cases, a lower grade is warranted. But typos and occasional poor grammar, in themselves, should generally not factor into the grading decisions for most papers. NCBE suggests ignoring typos for the most part because it is unreasonable to expect perfection in typing skills given the time pressure of the exam.

When assessing the quality of the writing, the focus should generally be on characteristics such as logical and effective organization, appropriate word choice and level of detail, and the presence or absence of a clear conclusion. The quality of the writing does matter, and while it remains important in MEE answers (and its absence is more obvious, if only because MEE answers are fairly brief), it comes to the forefront when grading MPTs. For one thing, while essay prompts ask the examinee to provide solid, reasoned legal analysis, the MPT instructs the examinee to consider both the audience of the work product and what tone is called for, objective or persuasive, to properly complete the task.

11. Know when to assign partial credit.

Essay exams are more forgiving than multiple-choice questions. If an examinee taking the MBE knows the relevant legal rule and is able to narrow the answer down to two options, one of which is the correct answer, but still selects the wrong answer, the examinee receives no credit for that question. The Scantron machine doesn’t care how close the examinee came to the right answer. But essay questions give examinees a chance to earn partial credit—they have an opportunity to demonstrate their ability to identify relevant facts and employ legal reasoning to reach a conclusion. Even if the ultimate conclusion is incorrect, an examinee who has stated the correct legal rule and then produced a cogent analysis of how the law would apply should get substantial credit. Graders should spend enough time on each paper to see where the examinee has shown some knowledge of the law and how it would apply to the given situation, even if the examinee does not reach the “correct” conclusion.

Similarly, just because an examinee hasn’t remembered the correct name of a legal doctrine, that doesn’t exclude that paper from receiving at least some credit. Depending on the range of quality of answers, an examinee should receive some amount of credit, even substantial credit, for describing the applicable rule or doctrine. The grader should ask whether the examinee’s discussion indicates that he or she is applying the same criteria covered by the relevant doctrine.

12. Acknowledge when a paper is incomplete.

With incomplete papers, those where the examinee clearly ran out of time (sometimes as obvious as a final sentence that cuts off, or a missing final issue, or analysis that starts strong but gets more superficial and conclusory toward the end), the grader can’t provide the answer that the examinee didn’t get to, no matter how promising the first paragraphs are. Fairness to all examinees requires that a grader award credit only for what is on the page, as other examinees were able to complete the essay or performance test in the time allowed by appropriately managing their time.

13. Know when to assign a zero.

All graders in a jurisdiction should be in agreement about when a paper should receive no credit, that is, a zero. A score of zero should be reserved for a blank page or an answer that is completely nonresponsive to the call of the question. This is important because essay answers that receive a zero are excluded from the reference group that is used to determine the formula for scaling essay scores to the MBE. Earning a 1 instead of a 0 should require that the examinee has made an honest attempt to answer the question.

Conclusion

Fairness, consistency, and focus are the cornerstones of good grading. Following these practices in grading bar exam essays and performance tests will not lessen the workload, but it will help ensure that bar exam essays and performance tests serve as reliable and valid indicators of an examinee’s competence to practice law, that scores are fair to examinees and are the result of meaningful differences in the quality of the answers, and that the quality of the writing—an important skill for all lawyers, regardless of practice area—is considered as a grading criterion.

Notes

NCBE’s MEE/MPT Grading Workshop is held in Madison, Wisconsin, the Saturday after each administration of the bar exam. The purpose of the workshop is to identify trends that graders will likely see when grading the MEE and the MPT in their jurisdictions as well as to discuss any questions graders have about the applicable law or the grading materials. While the workshop gives graders an orientation for grading, it is not intended to be a calibration session; that is best accomplished using a calibration packet comprising papers solely from the grader’s jurisdiction. (Calibration is the means by which graders develop coherent grading judgments so that rank-ordering is consistent by a single grader as well as across multiple graders.) (Go back)
See Mark A. Albanese, PhD, “The Testing Column: Essay and MPT Grading: Does Spread Really Matter?” 85(4) The Bar Examiner (December 2016) 29–35, at 30: “For the purposes of illustrating how spread in grades affects the [standard deviation—that is, the average deviation of scores from the mean—] a six-point scale works fairly well. There are enough different grade points that spread can be easily seen, yet not so many that one gets lost in the details of computation.” See also Susan Case, PhD, “The Testing Column: Bar Examining and Reliability,” 72(1) The Bar Examiner (February 2003) 23–26, at 24: “All else being equal, more score gradations work better than fewer score gradations. The key is to make sure that the scale reflects the level of judgments the grader can make…. A six-point grading scale tends to work better than a four-point grading scale. Something much broader, like a 20-point grading scale, would work better than a six-point scale, but only if the grader could make reasonable, consistent, meaningful decisions along that scale.” (Go back)
See Mark A. Albanese, PhD, supra note 2, at 32: “From a practical standpoint, we want to spread scores out as much as possible, but it is not necessary for the number of essays to be evenly distributed in each grade category; there are a range of distributions that achieve reasonably spread-out grades, but they tend to involve having some percentage of examinees in each grade category and not “bunching up” examinees too much into a small number of grade categories. In other words, uniform and bell-shaped distributions of grades are reasonable ways of “bucketing” examinees to ensure good spread in grades.” (Go back)
Scaling is a procedure that statistically adjusts raw scores for the written components of the bar exam (the MEE and the MPT) so that collectively they have the same mean and standard deviation (average distance of scores from the mean) as the jurisdiction’s scaled MBE scores. See Susan Case, PhD, “The Testing Column: Frequently Asked Questions about Scaling Written Test Scores to the MBE,” 75(4) The Bar Examiner (November 2006) 42–44 at 42: “In the bar examination setting, scaling is a statistical procedure that puts essay or performance test scores on the same score scale as the Multistate Bar Examination. Despite the change in scale, the rank ordering of individuals remains the same as it was on the original scale.” (Go back)
Scaling (see supra note 4) takes advantage of the equated MBE scores and therefore accounts for variance in difficulty of the essay questions from one administration to the next. See Mark A. Albanese, PhD, “The Testing Column: Scaling: It’s Not Just for Fish or Mountains,” 83(4) The Bar Examiner (December 2014) 50–56, at 55: “Scaling essay scores to the MBE will … stabilize passing rates even though the intrinsic difficulty of essay questions may vary.” (Go back)