Creating and Scoring Essay Tests

FatCamera / Getty Images

  • Tips & Strategies
  • An Introduction to Teaching
  • Policies & Discipline
  • Community Involvement
  • School Administration
  • Technology in the Classroom
  • Teaching Adult Learners
  • Issues In Education
  • Teaching Resources
  • Becoming A Teacher
  • Assessments & Tests
  • Elementary Education
  • Secondary Education
  • Special Education
  • Homeschooling
  • M.Ed., Curriculum and Instruction, University of Florida
  • B.A., History, University of Florida

Essay tests are useful for teachers when they want students to select, organize, analyze, synthesize, and/or evaluate information. In other words, they rely on the upper levels of Bloom's Taxonomy . There are two types of essay questions: restricted and extended response.

  • Restricted Response - These essay questions limit what the student will discuss in the essay based on the wording of the question. For example, "State the main differences between John Adams' and Thomas Jefferson's beliefs about federalism," is a restricted response. What the student is to write about has been expressed to them within the question.
  • Extended Response - These allow students to select what they wish to include in order to answer the question. For example, "In Of Mice and Men , was George's killing of Lennie justified? Explain your answer." The student is given the overall topic, but they are free to use their own judgment and integrate outside information to help support their opinion.

Student Skills Required for Essay Tests

Before expecting students to perform well on either type of essay question, we must make sure that they have the required skills to excel. Following are four skills that students should have learned and practiced before taking essay exams:

  • The ability to select appropriate material from the information learned in order to best answer the question.
  • The ability to organize that material in an effective manner.
  • The ability to show how ideas relate and interact in a specific context.
  • The ability to write effectively in both sentences and paragraphs.

Constructing an Effective Essay Question

Following are a few tips to help in the construction of effective essay questions:

  • Begin with the lesson objectives in mind. Make sure to know what you wish the student to show by answering the essay question.
  • Decide if your goal requires a restricted or extended response. In general, if you wish to see if the student can synthesize and organize the information that they learned, then restricted response is the way to go. However, if you wish them to judge or evaluate something using the information taught during class, then you will want to use the extended response.
  • If you are including more than one essay, be cognizant of time constraints. You do not want to punish students because they ran out of time on the test.
  • Write the question in a novel or interesting manner to help motivate the student.
  • State the number of points that the essay is worth. You can also provide them with a time guideline to help them as they work through the exam.
  • If your essay item is part of a larger objective test, make sure that it is the last item on the exam.

Scoring the Essay Item

One of the downfalls of essay tests is that they lack in reliability. Even when teachers grade essays with a well-constructed rubric, subjective decisions are made. Therefore, it is important to try and be as reliable as possible when scoring your essay items. Here are a few tips to help improve reliability in grading:

  • Determine whether you will use a holistic or analytic scoring system before you write your rubric . With the holistic grading system, you evaluate the answer as a whole, rating papers against each other. With the analytic system, you list specific pieces of information and award points for their inclusion.
  • Prepare the essay rubric in advance. Determine what you are looking for and how many points you will be assigning for each aspect of the question.
  • Avoid looking at names. Some teachers have students put numbers on their essays to try and help with this.
  • Score one item at a time. This helps ensure that you use the same thinking and standards for all students.
  • Avoid interruptions when scoring a specific question. Again, consistency will be increased if you grade the same item on all the papers in one sitting.
  • If an important decision like an award or scholarship is based on the score for the essay, obtain two or more independent readers.
  • Beware of negative influences that can affect essay scoring. These include handwriting and writing style bias, the length of the response, and the inclusion of irrelevant material.
  • Review papers that are on the borderline a second time before assigning a final grade.
  • How to Study Using the Basketball Review Game
  • Creating Effective Fill-in-the-Blank Questions
  • Teacher Housekeeping Tasks
  • 4 Tips for Effective Classroom Management
  • Dealing With Trips to the Bathroom During Class
  • Tips to Cut Writing Assignment Grading Time
  • Collecting Homework in the Classroom
  • Why Students Cheat and How to Stop Them
  • Planning Classroom Instruction
  • Using Cloze Tests to Determine Reading Comprehension
  • 10 Strategies to Boost Reading Comprehension
  • Taking Daily Attendance
  • How Scaffolding Instruction Can Improve Comprehension
  • Field Trips: Pros and Cons
  • Assignment Biography: Student Criteria and Rubric for Writing
  • 3 Poetry Activities for Middle School Students

Rubric Best Practices, Examples, and Templates

A rubric is a scoring tool that identifies the different criteria relevant to an assignment, assessment, or learning outcome and states the possible levels of achievement in a specific, clear, and objective way. Use rubrics to assess project-based student work including essays, group projects, creative endeavors, and oral presentations.

Rubrics can help instructors communicate expectations to students and assess student work fairly, consistently and efficiently. Rubrics can provide students with informative feedback on their strengths and weaknesses so that they can reflect on their performance and work on areas that need improvement.

How to Get Started

Best practices, moodle how-to guides.

  • Workshop Recording (Spring 2024)
  • Workshop Registration

Step 1: Analyze the assignment

The first step in the rubric creation process is to analyze the assignment or assessment for which you are creating a rubric. To do this, consider the following questions:

  • What is the purpose of the assignment and your feedback? What do you want students to demonstrate through the completion of this assignment (i.e. what are the learning objectives measured by it)? Is it a summative assessment, or will students use the feedback to create an improved product?
  • Does the assignment break down into different or smaller tasks? Are these tasks equally important as the main assignment?
  • What would an “excellent” assignment look like? An “acceptable” assignment? One that still needs major work?
  • How detailed do you want the feedback you give students to be? Do you want/need to give them a grade?

Step 2: Decide what kind of rubric you will use

Types of rubrics: holistic, analytic/descriptive, single-point

Holistic Rubric. A holistic rubric includes all the criteria (such as clarity, organization, mechanics, etc.) to be considered together and included in a single evaluation. With a holistic rubric, the rater or grader assigns a single score based on an overall judgment of the student’s work, using descriptions of each performance level to assign the score.

Advantages of holistic rubrics:

  • Can p lace an emphasis on what learners can demonstrate rather than what they cannot
  • Save grader time by minimizing the number of evaluations to be made for each student
  • Can be used consistently across raters, provided they have all been trained

Disadvantages of holistic rubrics:

  • Provide less specific feedback than analytic/descriptive rubrics
  • Can be difficult to choose a score when a student’s work is at varying levels across the criteria
  • Any weighting of c riteria cannot be indicated in the rubric

Analytic/Descriptive Rubric . An analytic or descriptive rubric often takes the form of a table with the criteria listed in the left column and with levels of performance listed across the top row. Each cell contains a description of what the specified criterion looks like at a given level of performance. Each of the criteria is scored individually.

Advantages of analytic rubrics:

  • Provide detailed feedback on areas of strength or weakness
  • Each criterion can be weighted to reflect its relative importance

Disadvantages of analytic rubrics:

  • More time-consuming to create and use than a holistic rubric
  • May not be used consistently across raters unless the cells are well defined
  • May result in giving less personalized feedback

Single-Point Rubric . A single-point rubric is breaks down the components of an assignment into different criteria, but instead of describing different levels of performance, only the “proficient” level is described. Feedback space is provided for instructors to give individualized comments to help students improve and/or show where they excelled beyond the proficiency descriptors.

Advantages of single-point rubrics:

  • Easier to create than an analytic/descriptive rubric
  • Perhaps more likely that students will read the descriptors
  • Areas of concern and excellence are open-ended
  • May removes a focus on the grade/points
  • May increase student creativity in project-based assignments

Disadvantage of analytic rubrics: Requires more work for instructors writing feedback

Step 3 (Optional): Look for templates and examples.

You might Google, “Rubric for persuasive essay at the college level” and see if there are any publicly available examples to start from. Ask your colleagues if they have used a rubric for a similar assignment. Some examples are also available at the end of this article. These rubrics can be a great starting point for you, but consider steps 3, 4, and 5 below to ensure that the rubric matches your assignment description, learning objectives and expectations.

Step 4: Define the assignment criteria

Make a list of the knowledge and skills are you measuring with the assignment/assessment Refer to your stated learning objectives, the assignment instructions, past examples of student work, etc. for help.

  Helpful strategies for defining grading criteria:

  • Collaborate with co-instructors, teaching assistants, and other colleagues
  • Brainstorm and discuss with students
  • Can they be observed and measured?
  • Are they important and essential?
  • Are they distinct from other criteria?
  • Are they phrased in precise, unambiguous language?
  • Revise the criteria as needed
  • Consider whether some are more important than others, and how you will weight them.

Step 5: Design the rating scale

Most ratings scales include between 3 and 5 levels. Consider the following questions when designing your rating scale:

  • Given what students are able to demonstrate in this assignment/assessment, what are the possible levels of achievement?
  • How many levels would you like to include (more levels means more detailed descriptions)
  • Will you use numbers and/or descriptive labels for each level of performance? (for example 5, 4, 3, 2, 1 and/or Exceeds expectations, Accomplished, Proficient, Developing, Beginning, etc.)
  • Don’t use too many columns, and recognize that some criteria can have more columns that others . The rubric needs to be comprehensible and organized. Pick the right amount of columns so that the criteria flow logically and naturally across levels.

Step 6: Write descriptions for each level of the rating scale

Artificial Intelligence tools like Chat GPT have proven to be useful tools for creating a rubric. You will want to engineer your prompt that you provide the AI assistant to ensure you get what you want. For example, you might provide the assignment description, the criteria you feel are important, and the number of levels of performance you want in your prompt. Use the results as a starting point, and adjust the descriptions as needed.

Building a rubric from scratch

For a single-point rubric , describe what would be considered “proficient,” i.e. B-level work, and provide that description. You might also include suggestions for students outside of the actual rubric about how they might surpass proficient-level work.

For analytic and holistic rubrics , c reate statements of expected performance at each level of the rubric.

  • Consider what descriptor is appropriate for each criteria, e.g., presence vs absence, complete vs incomplete, many vs none, major vs minor, consistent vs inconsistent, always vs never. If you have an indicator described in one level, it will need to be described in each level.
  • You might start with the top/exemplary level. What does it look like when a student has achieved excellence for each/every criterion? Then, look at the “bottom” level. What does it look like when a student has not achieved the learning goals in any way? Then, complete the in-between levels.
  • For an analytic rubric , do this for each particular criterion of the rubric so that every cell in the table is filled. These descriptions help students understand your expectations and their performance in regard to those expectations.

Well-written descriptions:

  • Describe observable and measurable behavior
  • Use parallel language across the scale
  • Indicate the degree to which the standards are met

Step 7: Create your rubric

Create your rubric in a table or spreadsheet in Word, Google Docs, Sheets, etc., and then transfer it by typing it into Moodle. You can also use online tools to create the rubric, but you will still have to type the criteria, indicators, levels, etc., into Moodle. Rubric creators: Rubistar , iRubric

Step 8: Pilot-test your rubric

Prior to implementing your rubric on a live course, obtain feedback from:

  • Teacher assistants

Try out your new rubric on a sample of student work. After you pilot-test your rubric, analyze the results to consider its effectiveness and revise accordingly.

  • Limit the rubric to a single page for reading and grading ease
  • Use parallel language . Use similar language and syntax/wording from column to column. Make sure that the rubric can be easily read from left to right or vice versa.
  • Use student-friendly language . Make sure the language is learning-level appropriate. If you use academic language or concepts, you will need to teach those concepts.
  • Share and discuss the rubric with your students . Students should understand that the rubric is there to help them learn, reflect, and self-assess. If students use a rubric, they will understand the expectations and their relevance to learning.
  • Consider scalability and reusability of rubrics. Create rubric templates that you can alter as needed for multiple assignments.
  • Maximize the descriptiveness of your language. Avoid words like “good” and “excellent.” For example, instead of saying, “uses excellent sources,” you might describe what makes a resource excellent so that students will know. You might also consider reducing the reliance on quantity, such as a number of allowable misspelled words. Focus instead, for example, on how distracting any spelling errors are.

Example of an analytic rubric for a final paper

Above Average (4)Sufficient (3)Developing (2)Needs improvement (1)
(Thesis supported by relevant information and ideas The central purpose of the student work is clear and supporting ideas always are always well-focused. Details are relevant, enrich the work.The central purpose of the student work is clear and ideas are almost always focused in a way that supports the thesis. Relevant details illustrate the author’s ideas.The central purpose of the student work is identified. Ideas are mostly focused in a way that supports the thesis.The purpose of the student work is not well-defined. A number of central ideas do not support the thesis. Thoughts appear disconnected.
(Sequencing of elements/ ideas)Information and ideas are presented in a logical sequence which flows naturally and is engaging to the audience.Information and ideas are presented in a logical sequence which is followed by the reader with little or no difficulty.Information and ideas are presented in an order that the audience can mostly follow.Information and ideas are poorly sequenced. The audience has difficulty following the thread of thought.
(Correctness of grammar and spelling)Minimal to no distracting errors in grammar and spelling.The readability of the work is only slightly interrupted by spelling and/or grammatical errors.Grammatical and/or spelling errors distract from the work.The readability of the work is seriously hampered by spelling and/or grammatical errors.

Example of a holistic rubric for a final paper

The audience is able to easily identify the central message of the work and is engaged by the paper’s clear focus and relevant details. Information is presented logically and naturally. There are minimal to no distracting errors in grammar and spelling. : The audience is easily able to identify the focus of the student work which is supported by relevant ideas and supporting details. Information is presented in a logical manner that is easily followed. The readability of the work is only slightly interrupted by errors. : The audience can identify the central purpose of the student work without little difficulty and supporting ideas are present and clear. The information is presented in an orderly fashion that can be followed with little difficulty. Grammatical and spelling errors distract from the work. : The audience cannot clearly or easily identify the central ideas or purpose of the student work. Information is presented in a disorganized fashion causing the audience to have difficulty following the author’s ideas. The readability of the work is seriously hampered by errors.

Single-Point Rubric

Advanced (evidence of exceeding standards)Criteria described a proficient levelConcerns (things that need work)
Criteria #1: Description reflecting achievement of proficient level of performance
Criteria #2: Description reflecting achievement of proficient level of performance
Criteria #3: Description reflecting achievement of proficient level of performance
Criteria #4: Description reflecting achievement of proficient level of performance
90-100 points80-90 points<80 points

More examples:

  • Single Point Rubric Template ( variation )
  • Analytic Rubric Template make a copy to edit
  • A Rubric for Rubrics
  • Bank of Online Discussion Rubrics in different formats
  • Mathematical Presentations Descriptive Rubric
  • Math Proof Assessment Rubric
  • Kansas State Sample Rubrics
  • Design Single Point Rubric

Technology Tools: Rubrics in Moodle

  • Moodle Docs: Rubrics
  • Moodle Docs: Grading Guide (use for single-point rubrics)

Tools with rubrics (other than Moodle)

  • Google Assignments
  • Turnitin Assignments: Rubric or Grading Form

Other resources

  • DePaul University (n.d.). Rubrics .
  • Gonzalez, J. (2014). Know your terms: Holistic, Analytic, and Single-Point Rubrics . Cult of Pedagogy.
  • Goodrich, H. (1996). Understanding rubrics . Teaching for Authentic Student Performance, 54 (4), 14-17. Retrieved from   
  • Miller, A. (2012). Tame the beast: tips for designing and using rubrics.
  • Ragupathi, K., Lee, A. (2020). Beyond Fairness and Consistency in Grading: The Role of Rubrics in Higher Education. In: Sanger, C., Gleason, N. (eds) Diversity and Inclusion in Global Higher Education. Palgrave Macmillan, Singapore.

Structure and Scoring of the Assessment

The structure of the assessment.

You'll begin by reading a prose passage of 700-1,000 words. This passage will be about as difficult as the readings in first-year courses at UC Berkeley. You'll have up to two hours to read the passage carefully and write an essay in response to a single topic and related questions based on the passage's content. These questions will generally ask you to read thoughtfully and to provide reasoned, concrete, and developed presentations of a specific point of view. Your essay will be evaluated on the basis of your ability to develop your central idea, to express yourself clearly, and to use the conventions of written English. 

Five Qualities of a Well-Written Essay

There is no "correct" response for the topic, but there are some things readers will look for in a strong, well-written essay.

  • The writer demonstrates that they understood the passage.
  • The writer maintains focus on the task assigned.
  • The writer leads readers to understand a point of view, if not to accept it.
  • The writer develops a central idea and provides specific examples.
  • The writer evaluates the reading passage in light of personal experience, observations, or by testing the author's assumptions against their own.

Scoring is typically completed within three weeks after the assessment date. The readers are UC Berkeley faculty members, primarily from College Writing Programs, though faculty from other related departments, such as English or Comparative Literature might participate as well. 

Your essay will be scored independently by two readers, who will not know your identity. They will measure your essay against a scoring guide. If the two readers have different opinions, then a third reader will assess your essay as well  to help reach a final decision. Each reader will give your essay a score on a scale of 1 (lowest) to 6 (highest). When your two scores are added together, if they are 8 or higher, you satisfy the Entry Level Writing Requirement and may take any 4-unit "R_A" course (first half of the requirement, usually numbered R1A, though sometimes with a different number). If you receive a score less than 8, you should sign up for College Writing R1A, which satisfies both the Entry Level Writing Requirement and the first-semester ("A" part) of the Reading and Composition Requirement.

The Scoring Guide

The Scoring Guide outlines the characteristics typical of essays at six different levels of competence. Readers assign each essay a score according to its main qualities. Readers take into account the fact that the responses are written with two hours of reading and writing, without a longer period of time for drafting and revision.

An essay with a score of 6 may

  • command attention because of its insightful development and mature style.
  • present a cogent response to the text, elaborating that response with well-chosen  examples and persuasive reasoning. 
  • present an organization that reinforces the development of the ideas which are aptly detailed.
  • show that its writer can usually choose words well, use sophisticated sentences effectively, and observe the conventions of written English. 

An essay with a score of 5 may

clearly demonstrate competent writing skill. 

present a thoughtful response to the text, elaborating  that response with appropriate examples and sensible reasoning.

present an organization that supports the writer’s ideas, which are developed with greater detail than is typical of an essay scored '4.' 

have a less fluent and complex style than an essay scored '6,' but  shows that the writer can usually choose words accurately, vary sentences effectively, and observe the conventions of written English.  

An essay with a score of 4 may

be just 'satisfactory.'

present an adequate response to  the text, elaborating that response with sufficient examples and acceptable reasoning.

demonstrate an organization that generally supports the writer’s ideas, which are developed with sufficient detail.

use examples and reasoning that are less developed than those in '5'  essays. 

show that its writer can usually choose words of sufficient precision, control sentences of reasonable  variety, and observe the conventions of written English.  

An essay with a score of 3 may

be unsatisfactory in one or more of the following ways:

It may respond to the  text illogically

it may reflect an incomplete understanding of the text or the topic

it may provide insufficient reasoning or lack elaboration with examples,  or the examples provided may not be sufficiently detailed to support claims

it may be inadequately organized 

have prose characterized by at least one of the following:

frequently imprecise word choice

little sentence variety

occasional major errors in grammar and usage, or frequent minor errors  

An essay with a score of 2 may

show weaknesses, ordinarily of several kinds.

present a  simplistic or inappropriate response to the text, one that may suggest some significant misunderstanding of the text or the topic

use organizational strategies that detract from coherence or provide inappropriate or irrelevant detail.

simplistic or inaccurate word choice

monotonous or fragmented sentence structure

many repeated errors in grammar and usage    

An essay with a score of 1 may

show serious weaknesses.

disregard the topic's demands, or it may lack structure or development.

Have an organization that fails to support the essay’s ideas. 

be inappropriately brief. 

have a pattern of errors in word choice, sentence structure, grammar, and usage.

Home

  • CRLT Consultation Services
  • Consultation
  • Midterm Student Feedback
  • Classroom Observation
  • Teaching Philosophy
  • Upcoming Events and Seminars
  • CRLT Calendar
  • Orientations
  • Teaching Academies
  • Provost's Seminars
  • For Faculty
  • For Grad Students & Postdocs
  • For Chairs, Deans & Directors
  • Customized Workshops & Retreats
  • Assessment, Curriculum, & Learning Analytics Services
  • CRLT in Engineering
  • CRLT Players
  • Foundational Course Initiative
  • CRLT Grants
  • Other U-M Grants
  • Provost's Teaching Innovation Prize
  • U-M Teaching Awards
  • Retired Grants
  • Staff Directory
  • Faculty Advisory Board
  • Annual Report
  • Equity-Focused Teaching
  • Preparing to Teach
  • Teaching Strategies
  • Testing and Grading
  • Teaching with Technology
  • Teaching Philosophy & Statements
  • Training GSIs
  • Evaluation of Teaching
  • Occasional Papers

Home

Best Practices for Designing and Grading Exams

Adapted from crlt occasional paper #24: m.e. piontek (2008), center for research on learning and teaching.

The most obvious function of assessment methods (such as exams, quizzes, papers, and presentations) is to enable instructors to make judgments about the quality of student learning (i.e., assign grades). However, the method of assessment also can have a direct impact on the quality of student learning. Students assume that the focus of exams and assignments reflects the educational goals most valued by an instructor, and they direct their learning and studying accordingly  (McKeachie  & Svinicki, 2006). General grading systems can have an impact as well.  For example, a strict bell curve (i.e., norm-reference grading) has the potential to dampen motivation and cooperation in a classroom, while a system that strictly rewards proficiency (i.e., criterion-referenced grading ) could be perceived as contributing to grade inflation. Given the importance of assessment for both faculty and student interactions about learning, how can instructors develop exams that provide useful and relevant data about their students' learning and also direct students to spend their time on the important aspects of a course or course unit? How do grading practices further influence this process?

Guidelines for Designing Valid and Reliable Exams

Ideally, effective exams have four characteristics. They are:

  • Valid, (providing useful information about the concepts they were designed to test),
  • Reliable (allowing consistent measurement and discriminating between different levels of performance),
  • Recognizable   (instruction has prepared students for the assessment), and
  • Realistic (concerning time and effort required to complete the assignment)  (Svinicki, 1999). 

Most importantly, exams and assignments should f ocus on the most important content and behaviors emphasized during the course (or particular section of the course). What are the primary ideas, issues, and skills you hope students learn during a particular course/unit/module? These are the learning outcomes you wish to measure. For example, if your learning outcome involves memorization, then you should assess for memorization or classification; if you hope students will develop problem-solving capacities, your exams should focus on assessing students’ application and analysis skills.  As a general rule, assessments that focus too heavily on details (e.g., isolated facts, figures, etc.) “will probably lead to better student retention of the footnotes at the cost of the main points" (Halpern & Hakel, 2003, p. 40). As noted in Table 1, each type of exam item may be better suited to measuring some learning outcomes than others, and each has its advantages and disadvantages in terms of ease of design, implementation, and scoring.

Table 1: Advantages and Disadvantages of Commonly Used Types of Achievement Test Items

Many items can be administered in a relatively short time. Moderately easy to write; easily scored.

Limited primarily to testing knowledge of information.  Easy to guess correctly on many items, even if material has not been mastered.

Can be used to assess broad range of content in a brief period. Skillfully written items can measure higher order cognitive skills. Can be scored quickly.

Difficult and time consuming to write good items. Possible to assess higher order cognitive skills, but most items assess only knowledge.  Some correct answers can be guesses.

Items can be written quickly. A broad range of content can be assessed. Scoring can be done efficiently.

Higher order cognitive skills are difficult to assess.

Many can be administered in a brief amount of time. Relatively efficient to score. Moderately easy to write.

Difficult to identify defensible criteria for correct answers. Limited to questions that can be answered or completed in very few words.

Can be used to measure higher order cognitive skills. Relatively easy to write questions. Difficult for respondent to get correct answer by guessing.

Time consuming to administer and score. Difficult to identify reliable criteria for scoring. Only a limited range of content can be sampled during any one testing period.

Adapted from Table 10.1 of Worthen, et al., 1993, p. 261.

General Guidelines for Developing Multiple-Choice and Essay Questions

The following sections highlight general guidelines for developing multiple-choice and essay questions, which are often used in college-level assessment because they readily lend themselves to measuring higher order thinking skills  (e.g., application, justification, inference, analysis and evaluation).  Yet instructors often struggle to create, implement, and score these types of questions (McMillan, 2001; Worthen, et al., 1993).

Multiple-choice questions have a number of advantages. First, they can measure various kinds of knowledge, including students' understanding of terminology, facts, principles, methods, and procedures, as well as their ability to apply, interpret, and justify. When carefully designed, multiple-choice items also can assess higher-order thinking skills.

Multiple-choice questions are less ambiguous than short-answer items, thereby providing a more focused assessment of student knowledge. Multiple-choice items are superior to true-false items in several ways: on true-false items, students can receive credit for knowing that a statement is incorrect, without knowing what is correct. Multiple-choice items offer greater reliability than true-false items as the opportunity for guessing is reduced with the larger number of options. Finally, an instructor can diagnose misunderstanding by analyzing the incorrect options chosen by students.

A disadvantage of multiple-choice items is that they require developing incorrect, yet plausible, options that can be difficult to create. In addition, multiple- choice questions do not allow instructors to measure students’ ability to organize and present ideas.  Finally, because it is much easier to create multiple-choice items that test recall and recognition rather than higher order thinking, multiple-choice exams run the risk of not assessing the deep learning that many instructors consider important (Greenland & Linn, 1990; McMillan, 2001).

Guidelines for writing multiple-choice items include advice about stems, correct answers, and distractors (McMillan, 2001, p. 150; Piontek, 2008):

  • S tems pose the problem or question.
  • Is the stem stated as clearly, directly, and simply as possible?
  • Is the problem described fully in the stem?
  • Is the stem stated positively, to avoid the possibility that students will overlook terms like “no,” “not,” or “least”?
  • Does the stem provide only information relevant to the problem?

Possible responses include the correct answer and distractors , or the incorrect choices. Multiple-choice questions usually have at least three distractors.

  • Are the distractors plausible to students who do not know the correct answer?
  • Is there only one correct answer?
  • Are all the possible answers parallel with respect to grammatical structure, length, and complexity?
  • Are the options short?
  • Are complex options avoided? Are options placed in logical order?
  • Are correct answers spread equally among all the choices? (For example, is answer “A” correct about the same number of times as options “B” or “C” or “D”)?

An example of good multiple-choice questions that assess higher-order thinking skills is the following test question from pharmacy (Park, 2008):

Patient WC was admitted for third-degree burns over 75% of his body. The attending physician asks you to start this patient on antibiotic therapy.  Which one of the following is the best reason why WC would need antibiotic prophylaxis? a. His burn injuries have broken down the innate immunity that prevents microbial invasion. b. His injuries have inhibited his cellular immunity. c. His injuries have impaired antibody production. d. His injuries have induced the bone marrow, thus activated immune system

A second question builds on the first by describing the patient’s labs two days later, asking the students to develop an explanation for the subsequent lab results. (See Piontek, 2008 for the full question.)

Essay questions can tap complex thinking by requiring students to organize and integrate information, interpret information, construct arguments, give explanations, evaluate the merit of ideas, and carry out other types of reasoning  (Cashin, 1987; Gronlund & Linn, 1990; McMillan, 2001; Thorndike, 1997; Worthen, et al., 1993). Restricted response essay questions are good for assessing basic knowledge and understanding and generally require a brief written response (e.g., “State two hypotheses about why birds migrate.  Summarize the evidence supporting each hypothesis” [Worthen, et al., 1993, p. 277].) Extended response essay items allow students to construct a variety of strategies, processes, interpretations and explanations for a question, such as the following:

The framers of the Constitution strove to create an effective national government that balanced the tension between majority rule and the rights of minorities. What aspects of American politics favor majority rule? What aspects protect the rights of those not in the majority? Drawing upon material from your readings and the lectures, did the framers successfully balance this tension? Why or why not? (Shipan, 2008).

In addition to measuring complex thinking and reasoning, advantages of essays include the potential for motivating better study habits and providing the students flexibility in their responses.  Instructors can evaluate how well students are able to communicate their reasoning with essay items, and they are usually less time consuming to construct than multiple-choice items that measure reasoning.

The major disadvantages of essays include the amount of time instructors must devote to reading and scoring student responses, and  the importance of developing and using carefully constructed criteria/rubrics to insure reliability of scoring. Essays can assess only a limited amount of content in one testing period/exam due to the length of time required for students to respond to each essay item. As a result, essays do not provide a good sampling of content knowledge across a curriculum (Gronlund & Linn, 1990; McMillan, 2001).

Guidelines for writing essay questions include the following (Gronlund & Linn, 1990; McMillan, 2001; Worthen, et al., 1993):

  • Restrict the use of essay questions to educational outcomes that are difficult to measure using other formats. For example, to test recall knowledge, true-false, fill-in-the-blank, or multiple-choice questions are better measures.
  • Generalizations : State a set of principles that can explain the following events.
  • Synthesis : Write a well-organized report that shows…
  • Evaluation : Describe the strengths and weaknesses of…
  • Write the question clearly so that students do not feel that they are guessing at “what the instructor wants me to do.”
  • Indicate the amount of time and effort students should spend on each essay item.
  • Avoid giving students options for which essay questions they should answer. This choice decreases the validity and reliability of the test because each student is essentially taking a different exam.
  • Consider using several narrowly focused questions (rather than one broad question) that elicit different aspects of students’ skills and knowledge.
  • Make sure there is enough time to answer the questions.

Guidelines for scoring essay questions include the following (Gronlund & Linn, 1990; McMillan, 2001; Wiggins, 1998; Worthen, et al., 1993; Writing and grading essay questions , 1990):

  • Outline what constitutes an expected answer.
  • Select an appropriate scoring method based on the criteria. A rubric is a scoring key that indicates the criteria for scoring and the amount of points to be assigned for each criterion.  A sample rubric for a take-home history exam question might look like the following:

Number of references to class reading sources

0-2 references

3-5 references

6+ references

Historical accuracy

Lots of inaccuracies

Few inaccuracies

No apparent inaccuracies

Historical Argument

No argument made; little evidence for argument

Argument is vague and unevenly supported by evidence

Argument is clear and well-supported by evidence

Proof reading

Many grammar and spelling errors

Few (1-2) grammar or spelling errors

No grammar or spelling errors

For other examples of rubrics, see CRLT Occasional Paper #24  (Piontek, 2008).

  • Clarify the role of writing mechanics and other factors independent of the educational outcomes being measured. For example, how does grammar or use of scientific notation figure into your scoring criteria?
  • Create anonymity for students’ responses while scoring and create a random order in which tests are graded (e.g., shuffle the pile) to increase accuracy of the scoring.
  • Use a systematic process for scoring each essay item.  Assessment guidelines suggest scoring all answers for an individual essay question in one continuous process, rather than scoring all answers to all questions for an individual student. This system makes it easier to remember the criteria for scoring each answer.

You can also use these guidelines for scoring essay items to create grading processes and rubrics for students’ papers, oral presentations, course projects, and websites.  For other grading strategies, see Responding to Student Writing – Principles & Practices and Commenting Effectively on Student Writing .

Cashin, W. E. (1987). Improving essay tests . Idea Paper, No. 17. Manhattan, KS: Center for Faculty Evaluation and Development, Kansas State University.

Gronlund, N. E., & Linn, R. L. (1990). Measurement and evaluation in teaching   (6th  ed.). New  York:  Macmillan Publishing Company.

Halpern, D. H., & Hakel, M. D. (2003). Applying the science of learning to the university and beyond. Change, 35 (4), 37-41.

McKeachie, W. J., & Svinicki, M. D. (2006). Assessing, testing, and evaluating: Grading is not the most important function.   In   McKeachie's   Teaching tips: Strategies, research, and theory for college and university teachers (12th ed., pp. 74-86). Boston: Houghton Mifflin Company.

McMillan, J. H. (2001).  Classroom assessment: Principles and practice for effective instruction.  Boston: Allyn and Bacon.

Park, J. (2008, February 4). Personal communication. University of Michigan College of Pharmacy.

Piontek, M. (2008). Best practices for designing and grading exams. CRLT Occasional Paper No. 24 . Ann Arbor, MI. Center for Research on Learning and Teaching.>

Shipan, C. (2008, February 4). Personal communication. University of Michigan Department of Political Science.

Svinicki, M.   D.   (1999a). Evaluating and grading students.  In Teachers and students: A sourcebook for UT- Austin faculty (pp. 1-14). Austin, TX: Center for Teaching Effectiveness, University of Texas at Austin.

Thorndike, R. M. (1997). Measurement and evaluation in psychology and education.   Upper Saddle River, NJ: Prentice-Hall, Inc.

Wiggins, G. P. (1998). Educative assessment: Designing assessments to inform and improve student performance . San Francisco: Jossey-Bass Publishers.

Worthen, B.  R., Borg, W.  R.,  & White, K.  R.  (1993). Measurement and evaluation in the schools .  New York: Longman.

Writing and grading essay questions. (1990, September). For Your Consideration , No. 7. Chapel Hill, NC: Center for Teaching and Learning, University of North Carolina at Chapel Hill.

back to top

Center for Research on Learning and Teaching logo

Contact CRLT

location_on University of Michigan 1071 Palmer Commons 100 Washtenaw Ave. Ann Arbor, MI 48109-2218

phone Phone: (734) 764-0505

description Fax: (734) 647-3600

email Email: [email protected]

Connect with CRLT

tweets

directions Directions to CRLT

group Staff Directory

markunread_mailbox Subscribe to our Blog

HOWTO: 3 Easy Steps to Grading Student Essays

  •  All topics A-Z
  •  Grammar
  •  Vocabulary
  •  Speaking
  •  Reading
  •  Listening
  •  Writing
  •  Pronunciation
  •  Virtual Classroom
  • Worksheets by season
  •  600 Creative Writing Prompts
  •  Warmers, fillers & ice-breakers
  •  Coloring pages to print
  •  Flashcards
  •  Classroom management worksheets
  •  Emergency worksheets
  •  Revision worksheets
  • Resources we recommend
| | -->
by Susan Verner 218,676 views

In a world where and often determine a student’s grade, what criteria does the writing teacher use to evaluate the work of his or her students? After all, with essay writing you cannot simply mark some answers correct and others incorrect and figure out a percentage. The good news is that

is a chart used in grading essays, special projects and other more items which can be more subjective. It lists each of the grading criteria separately and defines the different performance levels within those criteria. Standardized tests like the SAT’s use rubrics to score writing samples, and designing one for your own use is easy if you take it step by step. Keep in mind that when you are using a rubric to grade essays, you can design one rubric for use throughout the semester or modify your rubric as the expectations you have for your students increase.

. The essay should have good and show the right level of . It should be organized, and the content should be appropriate and effective. Teachers also look at the overall effectiveness of the piece. When evaluating specific writing samples, you may also want to include other criteria for the essay based on material you have covered in class. You may choose to grade on the type of essay they have written and whether your students have followed the specific direction you gave. You may want to evaluate their use of information and whether they correctly presented the content material you taught. When you write your own rubric, you can evaluate anything you think is important when it comes to your students’ writing abilities. .

, and ) we will write a rubric to evaluate students’ essays. The most straightforward evaluation uses a four-point scale for each of the criteria. Taking the criteria one at a time, articulate what your expectations are for an , a and so on. Taking grammar as an example, an would be free of most grammatical errors appropriate for the student’s language learning level. A would have some mistakes but use generally good grammar. A would show frequent grammatical errors. A would show that the student did not have the grammatical knowledge appropriate for his language learning level. Taking these definitions, we now put them into the rubric.

       
       

The next step is to take each of the other criteria and define success for each of those, assigning a value to A, B, C and D papers. Those definitions then go into the rubric in the appropriate locations to complete the chart.

Each of the criteria will score points for the essay. The descriptions in the first column are each worth 4 points, the second column 3 points, the third 2 points and the fourth 1 point.

What is the grading process?

Now that your criteria are defined, grading the essay is easy. When grading a student essay with a rubric, it is best to read through the essay once before evaluating for grades . Then reading through the piece a second time, determine where on the scale the writing sample falls for each of the criteria. If the student shows excellent grammar, good organization and a good overall effect, he would score a total of ten points. Divide that by the total criteria, three in this case, and he finishes with a 3.33. which on a four-point scale is a B+. If you use five criteria to evaluate your essays, divide the total points scored by five to determine the student’s grade.

Once you have written your grading rubric, you may decide to share your criteria with your students.

If you do, they will know exactly what your expectations are and what they need to accomplish to get the grade they desire. You may even choose to make a copy of the rubric for each paper and circle where the student lands for each criterion. That way, each person knows where he needs to focus his attention to improve his grade. The clearer your expectations are and the more feedback you give your students, the more successful your students will be. If you use a rubric in your essay grading, you can communicate those standards as well as make your grading more objective with more practical suggestions for your students. In addition, once you write your rubric you can use it for all future evaluations.

P.S. If you enjoyed this article, please help spread it by clicking one of those sharing buttons below. And if you are interested in more, you should follow our Facebook page where we share more about creative, non-boring ways to teach English.

Like us!

  • Teaching Ideas
  • Classroom Management and Discipline

Entire BusyTeacher Library

Popular articles like this

How to design a rubric that teachers can use and students can understand.

criteria for scoring essay items

How to Evaluate Speaking

Faq for writing teachers, but it is clear dealing with the defensive student, do student papers breed in your briefcase 4 methods of managing the paper load, tuning in the feedback 6 strategies for giving students feedback on speaking.

  • Copyright 2007-2021 пїЅ
  • Submit a worksheet
  • Mobile version
  • MyU : For Students, Faculty, and Staff
  • Academic Leaders
  • Faculty and Instructors
  • Graduate Students and Postdocs

Center for Educational Innovation

Request a consultation

  • Campus and Collegiate Liaisons
  • Pedagogical Innovations Journal Club
  • Teaching Enrichment Series
  • Recorded Webinars
  • Video Series
  • All Services
  • Teaching Consultations
  • Student Feedback Facilitation
  • Instructional Media Production
  • Curricular and Educational Initiative Consultations
  • Educational Research and Evaluation
  • Thank a Teacher
  • All Teaching Resources
  • Teaching with GenAI
  • Active Learning
  • Active Learning Classrooms
  • Aligned Course Design
  • Assessments
  • Documenting Growth in Teaching
  • Early Term Feedback
  • Inclusive Teaching at a Predominantly White Institution
  • Leveraging the Learning Sciences
  • Online Teaching and Design
  • Scholarship of Teaching and Learning
  • Strategies to Support Challenging Conversations in the Classroom
  • Teaching During the Election Season
  • Team Projects
  • Writing Your Teaching Philosophy
  • All Programs
  • Assessment Deep Dive
  • Designing and Delivering Online Learning
  • Early Career Teaching and Learning Program
  • International Teaching Assistant (ITA) Program
  • Preparing Future Faculty Program
  • Teaching with Access and Inclusion Program
  • "Teaching with AI" Book Club
  • Teaching for Student Well-Being Program
  • Teaching Assistant and Postdoc Professional Development Program
  • Essay Exams

Essay exams provide opportunities to evaluate students’ reasoning skills such as the ability to compare and contrast concepts, justify a position on a topic, interpret cases from the perspective of different theories or models, evaluate a claim or assertion with evidence, design an experiment, and other higher level cognitive skills. They can reveal if students understand the theory behind course material or how different concepts and theories relate to each other. 

+ Advantages and Challenges of essay exams

Advantages:

  • Can be used to measure higher order cognitive skills
  • Takes relatively less time to write questions
  • Difficult for respondents to get correct answers by guessing

Challenges:

  • Can be time consuming to administer and to score
  • Can be challenging to identify measurable, reliable criteria for assessing student responses
  • Limited range of content can be sampled during any one testing period
  • Timed exams in general add stress unrelated to a student's mastery of the material

+ Creating an essay exam

  • Limit the use of essay questions to learning aims that require learners to share their thinking processes, connect and analyze information, and communicate their understanding for a specific purpose. 
  • Write each item so that students clearly understand the specific task and what deliverables are required for a complete answer (e.g. diagram, amount of evidence, number of examples).
  • Indicate the relative amount of time and effort students should spend on each essay item, for example “2 – 3 sentences should suffice for this question”.
  • Consider using several narrowly focused items rather than one broad item.
  • Consider offering students choice among essay questions, while ensuring that all learning aims are assessed.

When designing essay exams, consider the reasoning skills you want to assess in your students. The following table lists different skills to measure with example prompts to guide assessment questions. 

Table from Piontek, 2008
Skill to Assess Possible Question Stems
Comparing
Relating Cause and Effect 
Justifying
Summarizing
Generalizing
Inferring
Classifying
Creating
Applying
Analyzing
Synthesizing

+ Preparing students for an essay exam

Adapted from Piontek, 2008

Prior to the essay exam

  • Administer a formative assessment that asks students to do a brief write on a question similar to one you will use on an exam and provide them with feedback on their responses.
  • Provide students with examples of essay responses that do and do not meet your criteria and standards. 
  • Provide students with the learning aims they will be responsible for mastering to help them focus their preparation appropriately.
  • Have students apply the scoring rubric to sample essay responses and provide them with feedback on their work.

Resource video : 2-minute video description of a formative assessment that helps prepare students for an essay exam. 

+ Administering an essay exam

  • Provide adequate time for students to take the assessment. A strategy some instructors use is to time themselves answering the exam questions completely and then multiply that time by 3-4.
  • Endeavor to create a distraction-free environment.
  • Review the suggestions for informal accommodations for multilingual learners , which may be helpful in setting up an essay exam for all learners.

+ Grading an essay exam

To ensure essays are graded fairly and without bias:

  • Outline what constitutes an acceptable answer (criteria for knowledge and skills).
  • Select an appropriate scoring method based on the criteria.
  • Clarify the role of writing mechanics and other factors independent of the learning aims being measured.
  • Share with students ahead of time.
  • Use a systematic process for scoring each essay item.  For instance, score all responses to a single question in one setting.
  • Anonymize student work (if possible) to ensure fairer and more objective feedback. For example students could use their student ID number in place of their name.

+ References & Resources

  • For more information on setting criteria, preparing students, and grading essay exams read:  Boye, A. (2019) Writing Better Essay Exams , IDEA paper #76.
  • For more detailed descriptions of how to develop and score essay exams read: Piontek, M.E. (2008). Best Practices for Designing and Grading Exams, CRLT Occasional Paper # 24.

Web resources

  • Designing Effective Writing Assignments  (Teaching with Writing Program - UMNTC ) 
  • Writing Assignment Checklist (Teaching with Writing Program - UMNTC)
  • Designing and Using Rubrics (Center for Writing - UMTC)
  • Caroline Hilk
  • Why Use Active Learning?
  • Successful Active Learning Implementation
  • Addressing Active Learning Challenges
  • Research and Resources
  • Addressing Challenges
  • Course Planning
  • Align Assessments
  • Multiple Low Stakes Assessments
  • Authentic Assessments
  • Formative and Summative Assessments
  • Varied Forms of Assessments
  • Cumulative Assessments
  • Equitable Assessments
  • Multiple Choice Exams and Quizzes
  • Academic Paper
  • Skill Observation
  • Alternative Assessments
  • Assessment Plan
  • Grade Assessments
  • Prepare Students
  • Reduce Student Anxiety
  • SRT Scores: Interpreting & Responding
  • Student Feedback Question Prompts
  • Definitions and PWI Focus
  • A Flexible Framework
  • Class Climate
  • Course Content
  • An Ongoing Endeavor
  • Working memory
  • Retrieval of information
  • Spaced practice
  • Active learning
  • Metacognition
  • Research Questions and Design
  • Gathering data
  • Publication
  • Learn About Your Context
  • Design Your Course to Support Challenging Conversations
  • Design Your Challenging Conversations Class Session
  • Use Effective Facilitation Strategies
  • What to Do in a Challenging Moment
  • Debrief and Reflect On Your Experience, and Try, Try Again
  • Supplemental Resources
  • Why Use Team Projects?
  • Project Description Examples
  • Project Description for Students
  • Team Projects and Student Development Outcomes
  • Forming Teams
  • Team Output
  • Individual Contributions to the Team
  • Individual Student Understanding
  • Supporting Students
  • Wrapping up the Project
  • GRAD 8101: Teaching in Higher Education
  • Finding a Practicum Mentor
  • GRAD 8200: Teaching for Learning
  • Proficiency Rating & TA Eligibility
  • Schedule a SETTA
  • TAPD Webinars

Penn Arts & Sciences Logo

  • University of Pennsylvania
  • School of Arts and Sciences
  • Penn Calendar

Search form

Penn Arts & Sciences Logo

Evaluation Criteria for Formal Essays

Katherine milligan.

Please note that these four categories are interdependent. For example, if your evidence is weak, this will almost certainly affect the quality of your argument and organization. Likewise, if you have difficulty with syntax, it is to be expected that your transitions will suffer. In revision, therefore, take a holistic approach to improving your essay, rather than focussing exclusively on one aspect.

An excellent paper:

Argument: The paper knows what it wants to say and why it wants to say it. It goes beyond pointing out comparisons to using them to change the reader?s vision. Organization: Every paragraph supports the main argument in a coherent way, and clear transitions point out why each new paragraph follows the previous one. Evidence: Concrete examples from texts support general points about how those texts work. The paper provides the source and significance of each piece of evidence. Mechanics: The paper uses correct spelling and punctuation. In short, it generally exhibits a good command of academic prose.

A mediocre paper:

Argument: The paper replaces an argument with a topic, giving a series of related observations without suggesting a logic for their presentation or a reason for presenting them. Organization: The observations of the paper are listed rather than organized. Often, this is a symptom of a problem in argument, as the framing of the paper has not provided a path for evidence to follow. Evidence: The paper offers very little concrete evidence, instead relying on plot summary or generalities to talk about a text. If concrete evidence is present, its origin or significance is not clear. Mechanics: The paper contains frequent errors in syntax, agreement, pronoun reference, and/or punctuation.

An appallingly bad paper:

Argument: The paper lacks even a consistent topic, providing a series of largely unrelated observations. Organization: The observations are listed rather than organized, and some of them do not appear to belong in the paper at all. Both paper and paragraphs lack coherence. Evidence: The paper offers no concrete evidence from the texts or misuses a little evidence. Mechanics: The paper contains constant and glaring errors in syntax, agreement, reference, spelling, and/or punctuation.

Exam Scoring

  • New Freshmen
  • New International Students
  • Info about COMPOSITION
  • Info about MATH
  • Info about SCIENCE
  • LOTE for Non-Native Speakers
  • Log-in Instructions
  • ALEKS PPL Math Placement Exam
  • Advanced Placement (AP) Credit
  • What is IB?
  • Advanced Level (A-Levels) Credit
  • Departmental Proficiency Exams
  • Departmental Proficiency Exams in LOTE ("Languages Other Than English")
  • Testing in Less Commonly Studied Languages
  • FAQ on placement testing
  • FAQ on proficiency testing
  • Legislation FAQ
  • 2024 Cutoff Scores Math
  • 2024 Cutoff Scores Chemistry
  • 2024 Cutoff Scores IMR-Biology
  • 2024 Cutoff Scores MCB
  • 2024 Cutoff Scores Physics
  • 2024 Cutoff Scores Rhetoric
  • 2024 Cutoff Scores ESL
  • 2024 Cutoff Scores Chinese
  • 2024 Cutoff Scores French
  • 2024 Cutoff Scores German
  • 2024 Cutoff Scores Latin
  • 2024 Cutoff Scores Spanish
  • 2024 Advanced Placement Program
  • 2024 International Baccalaureate Program
  • 2024 Advanced Level Exams
  • 2023 Cutoff Scores Math
  • 2023 Cutoff Scores Chemistry
  • 2023 Cutoff Scores IMR-Biology
  • 2023 Cutoff Scores MCB
  • 2023 Cutoff Scores Physics
  • 2023 Cutoff Scores Rhetoric
  • 2023 Cutoff Scores ESL
  • 2023 Cutoff Scores Chinese
  • 2023 Cutoff Scores French
  • 2023 Cutoff Scores German
  • 2023 Cutoff Scores Latin
  • 2023 Cutoff Scores Spanish
  • 2023 Advanced Placement Program
  • 2023 International Baccalaureate Program
  • 2023 Advanced Level Exams
  • 2022 Cutoff Scores Math
  • 2022 Cutoff Scores Chemistry
  • 2022 Cutoff Scores IMR-Biology
  • 2022 Cutoff Scores MCB
  • 2022 Cutoff Scores Physics
  • 2022 Cutoff Scores Rhetoric
  • 2022 Cutoff Scores ESL
  • 2022 Cutoff Scores Chinese
  • 2022 Cutoff Scores French
  • 2022 Cutoff Scores German
  • 2022 Cutoff Scores Latin
  • 2022 Cutoff Scores Spanish
  • 2022 Advanced Placement Program
  • 2022 International Baccalaureate Program
  • 2022 Advanced Level Exams
  • 2021 Cutoff Scores Math
  • 2021 Cutoff Scores Chemistry
  • 2021 Cutoff Scores IMR-Biology
  • 2021 Cutoff Scores MCB
  • 2021 Cutoff Scores Physics
  • 2021 Cutoff Scores Rhetoric
  • 2021 Cutoff Scores ESL
  • 2021 Cutoff Scores Chinese
  • 2021 Cutoff Scores French
  • 2021 Cutoff Scores German
  • 2021 Cutoff Scores Latin
  • 2021 Cutoff Scores Spanish
  • 2021 Advanced Placement Program
  • 2021 International Baccalaureate Program
  • 2021 Advanced Level Exams
  • 2020 Cutoff Scores Math
  • 2020 Cutoff Scores Chemistry
  • 2020 Cutoff Scores MCB
  • 2020 Cutoff Scores Physics
  • 2020 Cutoff Scores Rhetoric
  • 2020 Cutoff Scores ESL
  • 2020 Cutoff Scores Chinese
  • 2020 Cutoff Scores French
  • 2020 Cutoff Scores German
  • 2020 Cutoff Scores Latin
  • 2020 Cutoff Scores Spanish
  • 2020 Advanced Placement Program
  • 2020 International Baccalaureate Program
  • 2020 Advanced Level Exams
  • 2019 Cutoff Scores Math
  • 2019 Cutoff Scores Chemistry
  • 2019 Cutoff Scores MCB
  • 2019 Cutoff Scores Physics
  • 2019 Cutoff Scores Rhetoric
  • 2019 Cutoff Scores Chinese
  • 2019 Cutoff Scores ESL
  • 2019 Cutoff Scores French
  • 2019 Cutoff Scores German
  • 2019 Cutoff Scores Latin
  • 2019 Cutoff Scores Spanish
  • 2019 Advanced Placement Program
  • 2019 International Baccalaureate Program
  • 2019 Advanced Level Exams
  • 2018 Cutoff Scores Math
  • 2018 Cutoff Scores Chemistry
  • 2018 Cutoff Scores MCB
  • 2018 Cutoff Scores Physics
  • 2018 Cutoff Scores Rhetoric
  • 2018 Cutoff Scores ESL
  • 2018 Cutoff Scores French
  • 2018 Cutoff Scores German
  • 2018 Cutoff Scores Latin
  • 2018 Cutoff Scores Spanish
  • 2018 Advanced Placement Program
  • 2018 International Baccalaureate Program
  • 2018 Advanced Level Exams
  • 2017 Cutoff Scores Math
  • 2017 Cutoff Scores Chemistry
  • 2017 Cutoff Scores MCB
  • 2017 Cutoff Scores Physics
  • 2017 Cutoff Scores Rhetoric
  • 2017 Cutoff Scores ESL
  • 2017 Cutoff Scores French
  • 2017 Cutoff Scores German
  • 2017 Cutoff Scores Latin
  • 2017 Cutoff Scores Spanish
  • 2017 Advanced Placement Program
  • 2017 International Baccalaureate Program
  • 2017 Advanced Level Exams
  • 2016 Cutoff Scores Math
  • 2016 Cutoff Scores Chemistry
  • 2016 Cutoff Scores Physics
  • 2016 Cutoff Scores Rhetoric
  • 2016 Cutoff Scores ESL
  • 2016 Cutoff Scores French
  • 2016 Cutoff Scores German
  • 2016 Cutoff Scores Latin
  • 2016 Cutoff Scores Spanish
  • 2016 Advanced Placement Program
  • 2016 International Baccalaureate Program
  • 2016 Advanced Level Exams
  • 2015 Fall Cutoff Scores Math
  • 2016 Spring Cutoff Scores Math
  • 2015 Cutoff Scores Chemistry
  • 2015 Cutoff Scores Physics
  • 2015 Cutoff Scores Rhetoric
  • 2015 Cutoff Scores ESL
  • 2015 Cutoff Scores French
  • 2015 Cutoff Scores German
  • 2015 Cutoff Scores Latin
  • 2015 Cutoff Scores Spanish
  • 2015 Advanced Placement Program
  • 2015 International Baccalaureate (IB) Program
  • 2015 Advanced Level Exams
  • 2014 Cutoff Scores Math
  • 2014 Cutoff Scores Chemistry
  • 2014 Cutoff Scores Physics
  • 2014 Cutoff Scores Rhetoric
  • 2014 Cutoff Scores ESL
  • 2014 Cutoff Scores French
  • 2014 Cutoff Scores German
  • 2014 Cutoff Scores Latin
  • 2014 Cutoff Scores Spanish
  • 2014 Advanced Placement (AP) Program
  • 2014 International Baccalaureate (IB) Program
  • 2014 Advanced Level Examinations (A Levels)
  • 2013 Cutoff Scores Math
  • 2013 Cutoff Scores Chemistry
  • 2013 Cutoff Scores Physics
  • 2013 Cutoff Scores Rhetoric
  • 2013 Cutoff Scores ESL
  • 2013 Cutoff Scores French
  • 2013 Cutoff Scores German
  • 2013 Cutoff Scores Latin
  • 2013 Cutoff Scores Spanish
  • 2013 Advanced Placement (AP) Program
  • 2013 International Baccalaureate (IB) Program
  • 2013 Advanced Level Exams (A Levels)
  • 2012 Cutoff Scores Math
  • 2012 Cutoff Scores Chemistry
  • 2012 Cutoff Scores Physics
  • 2012 Cutoff Scores Rhetoric
  • 2012 Cutoff Scores ESL
  • 2012 Cutoff Scores French
  • 2012 Cutoff Scores German
  • 2012 Cutoff Scores Latin
  • 2012 Cutoff Scores Spanish
  • 2012 Advanced Placement (AP) Program
  • 2012 International Baccalaureate (IB) Program
  • 2012 Advanced Level Exams (A Levels)
  • 2011 Cutoff Scores Math
  • 2011 Cutoff Scores Chemistry
  • 2011 Cutoff Scores Physics
  • 2011 Cutoff Scores Rhetoric
  • 2011 Cutoff Scores French
  • 2011 Cutoff Scores German
  • 2011 Cutoff Scores Latin
  • 2011 Cutoff Scores Spanish
  • 2011 Advanced Placement (AP) Program
  • 2011 International Baccalaureate (IB) Program
  • 2010 Cutoff Scores Math
  • 2010 Cutoff Scores Chemistry
  • 2010 Cutoff Scores Rhetoric
  • 2010 Cutoff Scores French
  • 2010 Cutoff Scores German
  • 2010 Cutoff Scores Latin
  • 2010 Cutoff Scores Spanish
  • 2010 Advanced Placement (AP) Program
  • 2010 International Baccalaureate (IB) Program
  • 2009 Cutoff Scores Math
  • 2009 Cutoff Scores Chemistry
  • 2009 Cutoff Scores Rhetoric
  • 2009 Cutoff Scores French
  • 2009 Cutoff Scores German
  • 2009 Cutoff Scores Latin
  • 2009 Cutoff Scores Spanish
  • 2009 Advanced Placement (AP) Program
  • 2009 International Baccalaureate (IB) Program
  • 2008 Cutoff Scores Math
  • 2008 Cutoff Scores Chemistry
  • 2008 Cutoff Scores Rhetoric
  • 2008 Cutoff Scores French
  • 2008 Cutoff Scores German
  • 2008 Cutoff Scores Latin
  • 2008 Cutoff Scores Spanish
  • 2008 Advanced Placement (AP) Program
  • 2008 International Baccalaureate (IB) Program
  • Log in & Interpret Student Profiles
  • Mobius View
  • Classroom Test Analysis: The Total Report
  • Item Analysis
  • Error Report
  • Omitted or Multiple Correct Answers
  • QUEST Analysis
  • Assigning Course Grades

Improving Your Test Questions

  • ICES Online
  • Myths & Misperceptions
  • Longitudinal Profiles
  • List of Teachers Ranked as Excellent by Their Students
  • Focus Groups
  • IEF Question Bank

For questions or information:

  • Choosing between Objective and Subjective Test Items

Multiple-Choice Test Items

True-false test items, matching test items, completion test items, essay test items, problem solving test items, performance test items.

  • Two Methods for Assessing Test Item Quality
  • Assistance Offered by The Center for Innovation in Teaching and Learning (CITL)
  • References for Further Reading

I. Choosing Between Objective and Subjective Test Items

There are two general categories of test items: (1) objective items which require students to select the correct response from several alternatives or to supply a word or short phrase to answer a question or complete a statement; and (2) subjective or essay items which permit the student to organize and present an original answer. Objective items include multiple-choice, true-false, matching and completion, while subjective items include short-answer essay, extended-response essay, problem solving and performance test items. For some instructional purposes one or the other item types may prove more efficient and appropriate. To begin out discussion of the relative merits of each type of test item, test your knowledge of these two item types by answering the following questions.

(circle the correct answer)
1. Essay exams are easier to construct than objective exams.TF
2. Essay exams require more thorough student preparation and study time than objective exams.TF
3. Essay exams require writing skills where objective exams do not.TF
4. Essay exams teach a person how to write.TF
5. Essay exams are more subjective in nature than are objective exams.TF
6. Objective exams encourage guessing more so than essay exams.TF
7. Essay exams limit the extent of content covered.TF
8. Essay and objective exams can be used to measure the same content or ability.TF
9. Essay and objective exams are both good ways to evaluate a student's level of knowledge.TF

Quiz Answers

1.TRUEEssay items are generally easier and less time consuming to construct than are most objective test items. Technically correct and content appropriate multiple-choice and true-false test items require an extensive amount of time to write and revise. For example, a professional item writer produces only 9-10 good multiple-choice items in a day's time.
2.?According to research findings it is still undetermined whether or not essay tests require or facilitate more thorough (or even different) student study preparation.
3.TRUEWriting skills do affect a student's ability to communicate the correct "factual" information through an essay response. Consequently, students with good writing skills have an advantage over students who have difficulty expressing themselves through writing.
4.FALSEEssays do not teach a student how to write but they can emphasize the importance of being able to communicate through writing. Constant use of essay tests may encourage the knowledgeable but poor writing student to improve his/her writing ability in order to improve performance.
5.TRUEEssays are more subjective in nature due to their susceptibility to scoring influences. Different readers can rate identical responses differently, the same reader can rate the same paper differently over time, the handwriting, neatness or punctuation can unintentionally affect a paper's grade and the lack of anonymity can affect the grading process. While impossible to eliminate, scoring influences or biases can be minimized through procedures discussed later in this guide.
6.?Both item types encourage some form of guessing. Multiple-choice, true-false and matching items can be correctly answered through blind guessing, yet essay items can be responded to satisfactorily through well written bluffing.
7.TRUEDue to the extent of time required by the student to respond to an essay question, only a few essay questions can be included on a classroom exam. Consequently, a larger number of objective items can be tested in the same amount of time, thus enabling the test to cover more content.
8.TRUEBoth item types can measure similar content or learning objectives. Research has shown that students respond almost identically to essay and objective test items covering the same content. Studies by Sax & Collet (1968) and Paterson (1926) conducted forty-two years apart reached the same conclusion:
"...there seems to be no escape from the conclusions that the two types of exams are measuring identical things" (Paterson, 1926, p. 246).
This conclusion should not be surprising; after all, a well written essay item requires that the student (1) have a store of knowledge, (2) be able to relate facts and principles, and (3) be able to organize such information into a coherent and logical written expression, whereas an objective test item requires that the student (1) have a store of knowledge, (2) be able to relate facts and principles, and (3) be able to organize such information into a coherent and logical choice among several alternatives.
9.TRUEBoth objective and essay test items are good devices for measuring student achievement. However, as seen in the previous quiz answers, there are particular measurement situations where one item type is more appropriate than the other. Following is a set of recommendations for using either objective or essay test items: (Adapted from Robert L. Ebel, Essentials of Educational Measurement, 1972, p. 144).

1 Sax, G., & Collet, L. S. (1968). An empirical comparison of the effects of recall and multiple-choice tests on student achievement. J ournal of Educational Measurement, 5 (2), 169–173. doi:10.1111/j.1745-3984.1968.tb00622.x

Paterson, D. G. (1926). Do new and old type examinations measure different mental functions? School and Society, 24 , 246–248.

When to Use Essay or Objective Tests

Essay tests are especially appropriate when:

  • the group to be tested is small and the test is not to be reused.
  • you wish to encourage and reward the development of student skill in writing.
  • you are more interested in exploring the student's attitudes than in measuring his/her achievement.
  • you are more confident of your ability as a critical and fair reader than as an imaginative writer of good objective test items.

Objective tests are especially appropriate when:

  • the group to be tested is large and the test may be reused.
  • highly reliable test scores must be obtained as efficiently as possible.
  • impartiality of evaluation, absolute fairness, and freedom from possible test scoring influences (e.g., fatigue, lack of anonymity) are essential.
  • you are more confident of your ability to express objective test items clearly than of your ability to judge essay test answers correctly.
  • there is more pressure for speedy reporting of scores than for speedy test preparation.

Either essay or objective tests can be used to:

  • measure almost any important educational achievement a written test can measure.
  • test understanding and ability to apply principles.
  • test ability to think critically.
  • test ability to solve problems.
  • test ability to select relevant facts and principles and to integrate them toward the solution of complex problems. 

In addition to the preceding suggestions, it is important to realize that certain item types are  better suited  than others for measuring particular learning objectives. For example, learning objectives requiring the student  to demonstrate  or  to show , may be better measured by performance test items, whereas objectives requiring the student  to explain  or  to describe  may be better measured by essay test items. The matching of learning objective expectations with certain item types can help you select an appropriate kind of test item for your classroom exam as well as provide a higher degree of test validity (i.e., testing what is supposed to be tested). To further illustrate, several sample learning objectives and appropriate test items are provided on the following page.

Learning Objectives   Most Suitable Test Item
The student will be able to categorize and name the parts of the human skeletal system.   Objective Test Item (M-C, T-F, Matching)
The student will be able to critique and appraise another student's English composition on the basis of its organization.   Essay Test Item (Extended-Response)
The student will demonstrate safe laboratory skills.   Performance Test Item
The student will be able to cite four examples of satire that Twain uses in .   Essay Test Item (Short-Answer)

After you have decided to use either an objective, essay or both objective and essay exam, the next step is to select the kind(s) of objective or essay item that you wish to include on the exam. To help you make such a choice, the different kinds of objective and essay items are presented in the following section. The various kinds of items are briefly described and compared to one another in terms of their advantages and limitations for use. Also presented is a set of general suggestions for the construction of each item variation. 

II. Suggestions for Using and Writing Test Items

The multiple-choice item consists of two parts: (a) the stem, which identifies the question or problem and (b) the response alternatives. Students are asked to select the one alternative that best completes the statement or answers the question. For example:

Sample Multiple-Choice Item

(a)
(b)

*correct response

Advantages in Using Multiple-Choice Items

Multiple-choice items can provide...

  • versatility in measuring all levels of cognitive ability.
  • highly reliable test scores.
  • scoring efficiency and accuracy.
  • objective measurement of student achievement or ability.
  • a wide sampling of content or objectives.
  • a reduced guessing factor when compared to true-false items.
  • different response alternatives which can provide diagnostic feedback.

Limitations in Using Multiple-Choice Items

Multiple-choice items...

  • are difficult and time consuming to construct.
  • lead an instructor to favor simple recall of facts.
  • place a high degree of dependence on the student's reading ability and instructor's writing ability.

Suggestions For Writing Multiple-Choice Test Items

1. When possible, state the stem as a direct question rather than as an incomplete statement.
Undesirable:
Desirable:
2. Present a definite, explicit and singular question or problem in the stem.
Undesirable:
Desirable:
3. Eliminate excessive verbiage or irrelevant information from the stem.
Undesirable:
Desirable:
4. Include in the stem any word(s) that might otherwise be repeated in each alternative.
Undesirable:
5. Use negatively stated stems sparingly. When used, underline and/or capitalize the negative word.
Undesirable:
Desirable:

Item Alternatives

6. Make all alternatives plausible and attractive to the less knowledgeable or skillful student.
UndesirableDesirable
7. Make the alternatives grammatically parallel with each other, and consistent with the stem.
Undesirable:
8. Make the alternatives mutually exclusive.
Undesirable: The daily minimum required amount of milk that a 10 year old child should drink is
9. When possible, present alternatives in some logical order (e.g., chronological, most to least, alphabetical).
UndesirableDesirable
10. Be sure there is only one correct or best response to the item.
Undesirable:
11. Make alternatives approximately equal in length.
Undesirable:
12. Avoid irrelevant clues such as grammatical structure, well known verbal associations or connections between stem and answer.
Undesirable:
(grammatical clue)

of water behind the dam.

13. Use at least four alternatives for each item to lower the probability of getting the item correct by guessing.

14. Randomly distribute the correct response among the alternative positions throughout the test having approximately the same proportion of alternatives a, b, c, d and e as the correct response.

15. Use the alternatives "none of the above" and "all of the above" sparingly. When used, such alternatives should occasionally be used as the correct response.

A true-false item can be written in one of three forms: simple, complex, or compound. Answers can consist of only two choices (simple), more than two choices (complex), or two choices plus a conditional completion response (compound). An example of each type of true-false item follows:

Sample True-False Item: Simple

The acquisition of morality is a developmental process.TrueFalse

Sample True-False Item: Complex

Sample true-false item: compound.

The acquisition of morality is a developmental process.TrueFalse
 
 

Advantages In Using True-False Items

True-False items can provide...

  • the widest sampling of content or objectives per unit of testing time.
  • an objective measurement of student achievement or ability.

Limitations In Using True-False Items

True-false items...

  • incorporate an extremely high guessing factor. For simple true-false items, each student has a 50/50 chance of correctly answering the item without any knowledge of the item's content.
  • can often lead an instructor to write ambiguous statements due to the difficulty of writing statements which are unequivocally true or false.
  • do not discriminate between students of varying ability as well as other item types.
  • can often include more irrelevant clues than do other item types.
  • can often lead an instructor to favor testing of trivial knowledge.

Suggestions For Writing True-False Test Items

1.  Base true-false items upon statements that are absolutely true or false, without qualifications or exceptions.
Undesirable:
Desirable:
2.  Express the item statement as simply and as clearly as possible.
Undesirable:
Desirable:
3.  Express a single idea in each test item.
Undesirable:
Desirable:
4.  Include enough background information and qualifications so that the ability to respond correctly to the item does not depend on some special, uncommon knowledge.
Undesirable:
Desirable:
5.  Avoid lifting statements from the text, lecture or other materials so that memory alone will not permit a correct answer.
Undesirable:
Desirable:
6.  Avoid using negatively stated item statements.
Undesirable:
Desirable:
7.  Avoid the use of unfamiliar vocabulary.
Undesirable:
Desirable:
8.  Avoid the use of specific determiners which would permit a test-wise but unprepared examinee to respond correctly. Specific determiners refer to sweeping terms like "all," "always," "none," "never," "impossible," "inevitable," etc. Statements including such terms are likely to be false. On the other hand, statements using qualifying determiners such as "usually," "sometimes," "often," etc., are likely to be true. When statements do require the use of specific determiners, make sure they appear in both true and false items.
Undesirable:
required to rule on the constitutionality of a law. (T)
easier to score than an essay test. (T)
Desirable:
180°. (T)
other molecule of that compound. (T)
used for the metering of electrical energy used in a home. (F)
9.  False items tend to discriminate more highly than true items. Therefore, use more false items than true items (but no more than 15% additional false items).

In general, matching items consist of a column of stimuli presented on the left side of the exam page and a column of responses placed on the right side of the page. Students are required to match the response associated with a given stimulus. For example:

Sample Matching Test Item

Advantages In Using Matching Items

Matching items...

  • require short periods of reading and response time, allowing you to cover more content.
  • provide objective measurement of student achievement or ability.
  • provide highly reliable test scores.
  • provide scoring efficiency and accuracy.

Limitations in Using Matching Items

  • have difficulty measuring learning objectives requiring more than simple recall of information.
  • are difficult to construct due to the problem of selecting a common set of stimuli and responses.

Suggestions for Writing Matching Test Items

1.  Include directions which clearly state the basis for matching the stimuli with the responses. Explain whether or not a response can be used more than once and indicate where to write the answer.
Undesirable:
Desirable:
2.  Use only homogeneous material in matching items.
Undesirable:

1.

2.

3.

4.

5.

a.

b.

c.

d. O

e.

f.

Desirable:

1.

2.

3.

4. 

a. SO

b.

c.

d. O

e. HCl

3.  Arrange the list of responses in some systematic order if possible (e.g., chronological, alphabetical).
UndesirableDesirable

1.

2.

3.

4.

a.

b.

c.

d.

e.

a.

b.

c.

d.

e.

4.  Avoid grammatical or other clues to the correct response.
Undesirable:

1.

2.

3.

4.

Desirable:

5.  Keep matching items brief, limiting the list of stimuli to under 10.

6.  Include more responses than stimuli to help prevent answering through the process of elimination.

7.  When possible, reduce the amount of reading time by including only short phrases or single words in the response list.

The completion item requires the student to answer a question or to finish an incomplete statement by filling in a blank with the correct word or phrase. For example,

Sample Completion Item

According to Freud, personality is made up of three major systems, the _________, the ________ and the ________.

Advantages in Using Completion Items

Completion items...

  • can provide a wide sampling of content.
  • can efficiently measure lower levels of cognitive ability.
  • can minimize guessing as compared to multiple-choice or true-false items.
  • can usually provide an objective measure of student achievement or ability.

Limitations of Using Completion Items

  • are difficult to construct so that the desired response is clearly indicated.
  • are more time consuming to score when compared to multiple-choice or true-false items.
  • are more difficult to score since more than one answer may have to be considered correct if the item was not properly prepared.

Suggestions for Writing Completion Test Items

1.  Omit only significant words from the statement.
Undesirable: called a nucleus.
Desirable: .
2.  Do not omit so many words from the statement that the intended meaning is lost.
Undesirable:                                              
Desirable:                              
3.  Avoid grammatical or other clues to the correct response.
Undesirable: decimal system.
Desirable:
4.  Be sure there is only one correct response.
Undesirable: .
Desirable: .
5.  Make the blanks of equal length.
Undesirable: and   (Juno)  .
Desirable: and     (Juno)     .
6.  When possible, delete words at the end of the statement after the student has been presented a clearly defined problem.
Undesirable: .
Desirable: is     (122.5)     .

7.  Avoid lifting statements directly from the text, lecture or other sources.

8.  Limit the required response to a single word or phrase.

The essay test is probably the most popular of all types of teacher-made tests. In general, a classroom essay test consists of a small number of questions to which the student is expected to demonstrate his/her ability to (a) recall factual knowledge, (b) organize this knowledge and (c) present the knowledge in a logical, integrated answer to the question. An essay test item can be classified as either an extended-response essay item or a short-answer essay item. The latter calls for a more restricted or limited answer in terms of form or scope. An example of each type of essay item follows.

Sample Extended-Response Essay Item

Explain the difference between the S-R (Stimulus-Response) and the S-O-R (Stimulus-Organism-Response) theories of personality. Include in your answer (a) brief descriptions of both theories, (b) supporters of both theories and (c) research methods used to study each of the two theories. (10 pts.  20 minutes)

Sample Short-Answer Essay Item

Identify research methods used to study the S-R (Stimulus-Response) and S-O-R (Stimulus-Organism-Response) theories of personality. (5 pts.  10 minutes)

Advantages In Using Essay Items

Essay items...

  • are easier and less time consuming to construct than are most other item types.
  • provide a means for testing student's ability to compose an answer and present it in a logical manner.
  • can efficiently measure higher order cognitive objectives (e.g., analysis, synthesis, evaluation).

Limitations In Using Essay Items

  • cannot measure a large amount of content or objectives.
  • generally provide low test and test scorer reliability.
  • require an extensive amount of instructor's time to read and grade.
  • generally do not provide an objective measure of student achievement or ability (subject to bias on the part of the grader).

Suggestions for Writing Essay Test Items

1.  Prepare essay items that elicit the type of behavior you want to measure.
Learning Objective: The student will be able to explain how the normal curve serves as a statistical model.
Undesirable: Describe a normal curve in terms of: symmetry, modality, kurtosis and skewness.
Desirable: Briefly explain how the normal curve serves as a statistical model for estimation and hypothesis testing.
2.  Phrase each item so that the student's task is clearly indicated.
Undesirable: Discuss the economic factors which led to the stock market crash of 1929.
Desirable: Identify the three major economic conditions which led to the stock market crash of 1929. Discuss briefly each condition in correct chronological sequence and in one paragraph indicate how the three factors were inter-related.
3.  Indicate for each item a point value or weight and an estimated time limit for answering.
Undesirable: Compare the writings of Bret Harte and Mark Twain in terms of settings, depth of characterization, and dialogue styles of their main characters.
Desirable: Compare the writings of Bret Harte and Mark Twain in terms of settings, depth of characterization, and dialogue styles of their main characters. (10 points 20 minutes)

4.  Ask questions that will elicit responses on which experts could agree that one answer is better than another.

5.  Avoid giving the student a choice among optional items as this greatly reduces the reliability of the test.

6.  It is generally recommended for classroom examinations to administer several short-answer items rather than only one or two extended-response items.

Suggestions for Scoring Essay Items

ANALYTICAL SCORING:Each answer is compared to an ideal answer and points are assigned for the inclusion of necessary elements. Grades are based on the number of accumulated points either absolutely (i.e., A=10 or more points, B=6-9 pts., etc.) or relatively (A=top 15% scores, B=next 30% of scores, etc.)
GLOBAL QUALITY:Each answer is read and assigned a score (e.g., grade, total points) based either on the total quality of the response or on the total quality of the response relative to other student answers.

Examples Essay Item and Grading Models

"Americans are a mixed-up people with no sense of ethical values. Everyone knows that baseball is far less necessary than food and steel, yet they pay ball players a lot more than farmers and steelworkers."

WHY? Use 3-4 sentences to indicate how an economist would explain the above situation.

Analytical Scoring

Global Quality

Assign scores or grades on the overall quality of the written response as compared to an ideal answer. Or, compare the overall quality of a response to other student responses by sorting the papers into three stacks:

Read and sort each stack again divide into three more stacks

In total, nine discriminations can be used to assign test grades in this manner. The number of stacks or discriminations can vary to meet your needs.

  • Try not to allow factors which are irrelevant to the learning outcomes being measured affect your grading (i.e., handwriting, spelling, neatness).
  • Read and grade all class answers to one item before going on to the next item.
  • Read and grade the answers without looking at the students' names to avoid possible preferential treatment.
  • Occasionally shuffle papers during the reading of answers to help avoid any systematic order effects (i.e., Sally's "B" work always followed Jim's "A" work thus it looked more like "C" work).
  • When possible, ask another instructor to read and grade your students' responses.

Another form of a subjective test item is the problem solving or computational exam question. Such items present the student with a problem situation or task and require a demonstration of work procedures and a correct solution, or just a correct solution. This kind of test item is classified as a subjective type of item due to the procedures used to score item responses. Instructors can assign full or partial credit to either correct or incorrect solutions depending on the quality and kind of work procedures presented. An example of a problem solving test item follows.

Example Problem Solving Test Item

It was calculated that 75 men could complete a strip on a new highway in 70 days. When work was scheduled to commence, it was found necessary to send 25 men on another road project. How many days longer will it take to complete the strip? Show your work for full or partial credit.

Advantages In Using Problem Solving Items

Problem solving items...

  • minimize guessing by requiring the students to provide an original response rather than to select from several alternatives.
  • are easier to construct than are multiple-choice or matching items.
  • can most appropriately measure learning objectives which focus on the ability to apply skills or knowledge in the solution of problems.
  • can measure an extensive amount of content or objectives.

Limitations in Using Problem Solving Items

  • require an extensive amount of instructor time to read and grade.
  • generally do not provide an objective measure of student achievement or ability (subject to bias on the part of the grader when partial credit is given).

Suggestions For Writing Problem Solving Test Items

1.  Clearly identify and explain the problem.
Undesirable:
Desirable:
2.  Provide directions which clearly inform the student of the type of response called for.
Undesirable:
Desirable:
3.  State in the directions whether or not the student must show his/her work procedures for full or partial credit.
Undesirable:
Desirable:
4.  Clearly separate item parts and indicate their point values.
A man leaves his home and drives to a convention at an average rate of 50 miles per hour. Upon arrival, he finds a telegram advising him to return at once. He catches a plane that takes him back at an average rate of 300 miles per hour.
Undesirable:
Desirable:


5.  Use figures, conditions and situations which create a realistic problem.
Undesirable:
Desirable:

6.  Ask questions that elicit responses on which experts could agree that one solution and one or more work procedures are better than others.

7.  Work through each problem before classroom administration to double-check accuracy.

A performance test item is designed to assess the ability of a student to perform correctly in a simulated situation (i.e., a situation in which the student will be ultimately expected to apply his/her learning). The concept of simulation is central in performance testing; a performance test will simulate to some degree a real life situation to accomplish the assessment. In theory, a performance test could be constructed for any skill and real life situation. In practice, most performance tests have been developed for the assessment of vocational, managerial, administrative, leadership, communication, interpersonal and physical education skills in various simulated situations. An illustrative example of a performance test item is provided below.

Sample Performance Test Item

Assume that some of the instructional objectives of an urban planning course include the development of the student's ability to effectively use the principles covered in the course in various "real life" situations common for an urban planning professional. A performance test item could measure this development by presenting the student with a specific situation which represents a "real life" situation. For example,

An urban planning board makes a last minute request for the professional to act as consultant and critique a written proposal which is to be considered in a board meeting that very evening. The professional arrives before the meeting and has one hour to analyze the written proposal and prepare his critique. The critique presentation is then made verbally during the board meeting; reactions of members of the board or the audience include requests for explanation of specific points or informed attacks on the positions taken by the professional.

The performance test designed to simulate this situation would require that the student to be tested role play the professional's part, while students or faculty act the other roles in the situation. Various aspects of the "professional's" performance would then be observed and rated by several judges with the necessary background. The ratings could then be used both to provide the student with a diagnosis of his/her strengths and weaknesses and to contribute to an overall summary evaluation of the student's abilities.

Advantages In Using Performance Test Items

Performance test items...

  • can most appropriately measure learning objectives which focus on the ability of the students to apply skills or knowledge in real life situations.
  • usually provide a degree of test validity not possible with standard paper and pencil test items.
  • are useful for measuring learning objectives in the psychomotor domain.

Limitations In Using Performance Test Items

  • are difficult and time consuming to construct.
  • are primarily used for testing students individually and not for testing groups. Consequently, they are relatively costly, time consuming, and inconvenient forms of testing.
  • generally do not provide an objective measure of student achievement or ability (subject to bias on the part of the observer/grader).

Suggestions For Writing Performance Test Items

  • Prepare items that elicit the type of behavior you want to measure.
  • Clearly identify and explain the simulated situation to the student.
  • Make the simulated situation as "life-like" as possible.
  • Provide directions which clearly inform the students of the type of response called for.
  • When appropriate, clearly state time and activity limitations in the directions.
  • Adequately train the observer(s)/scorer(s) to ensure that they are fair in scoring the appropriate behaviors.

III. TWO METHODS FOR ASSESSING TEST ITEM QUALITY

This section presents two methods for collecting feedback on the quality of your test items. The two methods include using self-review checklists and student evaluation of test item quality. You can use the information gathered from either method to identify strengths and weaknesses in your item writing. 

Checklist for Evaluating Test Items

EVALUATE YOUR TEST ITEMS BY CHECKING THE SUGGESTIONS WHICH YOU FEEL YOU HAVE FOLLOWED.  

____ When possible, stated the stem as a direct question rather than as an incomplete statement.
____ Presented a definite, explicit and singular question or problem in the stem.
____ Eliminated excessive verbiage or irrelevant information from the stem.
____ Included in the stem any word(s) that might have otherwise been repeated in each alternative.
____ Used negatively stated stems sparingly. When used, underlined and/or capitalized the negative word(s).
____ Made all alternatives plausible and attractive to the less knowledgeable or skillful student.
____ Made the alternatives grammatically parallel with each other, and consistent with the stem.
____ Made the alternatives mutually exclusive.
____ When possible, presented alternatives in some logical order (e.g., chronologically, most to least).
____ Made sure there was only one correct or best response per item.
____ Made alternatives approximately equal in length.
____ Avoided irrelevant clues such as grammatical structure, well known verbal associations or connections between stem and answer.
____ Used at least four alternatives for each item.
____ Randomly distributed the correct response among the alternative positions throughout the test having approximately the same proportion of alternatives a, b, c, d, and e as the correct response.
____ Used the alternatives "none of the above" and "all of the above" sparingly. When used, such alternatives were occasionally the correct response.
____ Based true-false items upon statements that are absolutely true or false, without qualifications or exceptions.
____ Expressed the item statement as simply and as clearly as possible.
____ Expressed a single idea in each test item.
____ Included enough background information and qualifications so that the ability to respond correctly did not depend on some special, uncommon knowledge.
____ Avoided lifting statements from the text, lecture, or other materials.
____ Avoided using negatively stated item statements.
____ Avoided the use of unfamiliar language.
____ Avoided the use of specific determiners such as "all," "always," "none," "never," etc., and qualifying determiners such as "usually," "sometimes," "often," etc.
____ Used more false items than true items (but not more than 15% additional false items).
____ Included directions which clearly stated the basis for matching the stimuli with the response.
____ Explained whether or not a response could be used more than once and indicated where to write the answer.
____ Used only homogeneous material.
____ When possible, arranged the list of responses in some systematic order (e.g., chronologically, alphabetically).
____ Avoided grammatical or other clues to the correct response.
____ Kept items brief (limited the list of stimuli to under 10).
____ Included more responses than stimuli.

____ 

When possible, reduced the amount of reading time by including only short phrases or single words in the response list.
____ Omitted only significant words from the statement.
____ Did not omit so many words from the statement that the intended meaning was lost.
____ Avoided grammatical or other clues to the correct response.
____ Included only one correct response per item.
____ Made the blanks of equal length.
____ When possible, deleted the words at the end of the statement after the student was presented with a clearly defined problem.
____ Avoided lifting statements directly from the text, lecture, or other sources.
____ Limited the required response to a single word or phrase.
____ Prepared items that elicited the type of behavior you wanted to measure.
____ Phrased each item so that the student's task was clearly indicated.
____ Indicated for each item a point value or weight and an estimated time limit for answering.
____ Asked questions that elicited responses on which experts could agree that one answer is better than others.
____ Avoided giving the student a choice among optional items.
____ Administered several short-answer items rather than 1 or 2 extended-response items.

Grading Essay Test Items

____ Selected an appropriate grading model.
____ Tried not to allow factors which were irrelevant to the learning outcomes being measured to affect your grading (e.g., handwriting, spelling, neatness).
____ Read and graded all class answers to one item before going on to the next item.
____ Read and graded the answers without looking at the student's name to avoid possible preferential treatment.
____ Occasionally shuffled papers during the reading of answers.
____ When possible, asked another instructor to read and grade your students' responses.
____ Clearly identified and explained the problem to the student.
____ Provided directions which clearly informed the student of the type of response called for.
____ Stated in the directions whether or not the student must show work procedures for full or partial credit.
____ Clearly separated item parts and indicated their point values.
____ Used figures, conditions and situations which created a realistic problem.
____ Asked questions that elicited responses on which experts could agree that one solution and one or more work procedures are better than others.

____ 

Worked through each problem before classroom administration.
____ Prepared items that elicit the type of behavior you wanted to measure.
____ Clearly identified and explained the simulated situation to the student.
____ Made the simulated situation as "life-like" as possible.
____ Provided directions which clearly inform the students of the type of response called for.
____ When appropriate, clearly stated time and activity limitations in the directions.
____ Adequately trained the observer(s)/scorer(s) to ensure that they were fair in scoring the appropriate behaviors.

STUDENT EVALUATION OF TEST ITEM QUALITY 

Using ices questionnaire items to assess your test item quality .

The following set of ICES (Instructor and Course Evaluation System) questionnaire items can be used to assess the quality of your test items. The items are presented with their original ICES catalogue number. You are encouraged to include one or more of the items on the ICES evaluation form in order to collect student opinion of your item writing quality.

102--How would you rate the instructor's examination questions?116--Did the exams challenge you to do original thinking?
ExcellentPoorYes, very challengingNo, not challenging
103--How well did examination questions reflect content and emphasis of the course?118--Were there "trick" or trite questions on tests?
Well relatedPoorly relatedLots of themFew if any
114--The exams reflected important points in the reading assignments.122--How difficult were the examinations?
Strongly agreeStrongly disagreeToo difficultToo easy
119--Were exam questions worded clearly?123--I found I could score reasonably well on exams by just cramming.
Yes, very clearNo, very unclearStrongly agreeStrongly disagree
115--Were the instructor's test questions thought provoking?121--How was the length of exams for the time allotted.
Definitely yesDefinitely noToo longToo short
125--Were exams adequately discussed upon return?109--Were exams, papers, reports returned with errors explained or personal comments?
Yes, adequatelyNo, not enoughAlmost alwaysAlmost never

IV. ASSISTANCE OFFERED BY THE CENTER FOR INNOVATION IN TEACHING AND LEARNING (CITL)

The information on this page is intended for self-instruction. However, CITL staff members will consult with faculty who wish to analyze and improve their test item writing. The staff can also consult with faculty about other instructional problems. Instructors wishing to acquire CITL assistance can contact [email protected]

V. REFERENCES FOR FURTHER READING

Ebel, R. L. (1965). Measuring educational achievement . Prentice-Hall. Ebel, R. L. (1972). Essentials of educational measurement . Prentice-Hall. Gronlund, N. E. (1976). Measurement and evaluation in teaching (3rd ed.). Macmillan. Mehrens W. A. & Lehmann I. J. (1973). Measurement and evaluation in education and psychology . Holt, Rinehart & Winston. Nelson, C. H. (1970). Measurement and evaluation in the classroom . Macmillan. Payne, D. A. (1974).  The assessment of learning: Cognitive and affective . D.C. Heath & Co. Scannell, D. P., & Tracy D. B. (1975). Testing and measurement in the classroom . Houghton Mifflin. Thorndike, R. L. (1971). Educational measurement (2nd ed.). American Council on Education.

Center for Innovation in Teaching & Learning

249 Armory Building 505 East Armory Avenue Champaign, IL 61820

217 333-1462

Email: [email protected]

Office of the Provost

 
 
TOPICS A. Fill-in-the-Blank Items B. Essay Questions C. Scoring Options

Assignments


Although essay questions are powerful assessment tools, they can be difficult to score. With essays, there isn't a single, correct answer and it is almost impossible to use an automatic scantron or computer-based system. In order to minimize the subjectivity and bias that may occur in the assessment, teachers should prepare a list of criteria prior to scoring the essays. Consider, for example, the following question and scoring criteria:

Consider the time period during the Vietnam War and the reasons there were riots in cities and at university campuses. Write an essay explaining three of those reasons. Include information on the impact (if any) of the riots. The essay should be approximately one page in length.  Your score will depend on the accuracy of your reasons, the organization of your essay, and brevity.  Although spelling, punctuation, and grammar will not be considered in grading, please do your best to consider them in your writing. (10 points possible)

By outlining the criteria for assessment, the students know precisely how they will be assessed and where they should concentrate their efforts. In addition, the instructor can provide feedback that is less biased and more consistent. Additional techniques for scoring constructed response items include:

  • Do not look at the student's name when you grade the essay.
  • Outline an exemplary response before reviewing student responses.
  • Scan through the responses and look for major discrepancies in the answers -- this might indicate that the question was not clear.
  • If there are multiple questions, score Question #1 for all students, then Question #2, etc.
  • Use a scoring rubric that provides specific areas of feedback for the students.

Detailed information about constructing checklists and rubrics for scoring is provided in the Performance Assessments lesson. 

One way for teachers and students to become more proficient at writing and scoring essay questions is to practice scoring the responses of other students. Select one of the three programs below (the one closest to your grade level) and practice applying the rubric to score the constructed responses. Feedback is provided, as well as links to the corresponding rubric.

-- Select "Rubric Scoring" on the Main Menu -- Select "Practice Activities" on the Main Menu -- Select "Practice Activities" on the Main Menu

 

| | | Constructed Response


and the at .

 

Online Course

Module components.

  • Assignments
  • Test Scoring
  • Item Analysis

Evaluating Written Items

Nurs 791 - instructional strategies and assessment, module 12: test interpretation.

The content component of is determined by the nature of the question/item. The test developer must decide what elements must be present to be considered “correct.” This is critical as individual respondents may vary widely in how they follow the instructions and present their response. This process presumes a well-written essay item with crystal clear instructions for the respondents.

Often the pre-determined content component is assessed by using a rubric. Given the nature of the written item, the teacher can develop the criteria in rubric format. What is a rubric? TeacherVision offers an excellent definition, speaks to the advantages of using rubrics, and offers a great example based on evaluating chocolate chip cookies!

Module Components - Overview | Assignments Topics - Test Scoring | Item Analysis | Evaluating Written Items | Grading

This website is maintained by the University of Maryland School of Nursing (UMSON) Office of Learning Technologies. The UMSON logo and all other contents of this website are the sole property of UMSON and may not be used for any purpose without prior written consent. Links to other websites do not constitute or imply an endorsement of those sites, their content, or their products and services. Please send comments, corrections, and link improvements to [email protected] .

  • Corpus ID: 74111145

SCORING IN THE ESSAY TESTS QUESTIONS: METHODS, CHALLENGES AND STRATEGIES

  • E. Moradi , H. Didehban
  • Published 15 November 2015
  • Journal of Urmia Nursing and Midwifery Faculty

51 References

Writing evaluation: what can analytic versus holistic essay scoring tell us., improving essay tests: structuring the items and scoring responses., scoring with the computer: alternative procedures for improving the reliability of holistic essay scoring, essay test scoring: interaction of relevant variables, a new automated essay scoring:teaching resource program, the effect of two assessment methods on exam preparation and study strategies: multiple choice and essay questions, comparing the validity of automated and human scoring of essays, effects of using a scoring guide on essay scores: generalizability theory, effect of scoring patterns on scorer reliability in economics essay tests, evaluating the reliability of a detailed analytic scoring rubric for foreign language writing, related papers.

Showing 1 through 3 of 0 Related Papers

English Composition 1

Evaluation and grading criteria for essays.

IVCC's online Style Book presents the Grading Criteria for Writing Assignments .

This page explains some of the major aspects of an essay that are given special attention when the essay is evaluated.

Thesis and Thesis Statement

Probably the most important sentence in an essay is the thesis statement, which is a sentence that conveys the thesis—the main point and purpose of the essay. The thesis is what gives an essay a purpose and a point, and, in a well-focused essay, every part of the essay helps the writer develop and support the thesis in some way.

The thesis should be stated in your introduction as one complete sentence that

  • identifies the topic of the essay,
  • states the main points developed in the essay,
  • clarifies how all of the main points are logically related, and
  • conveys the purpose of the essay.

In high school, students often are told to begin an introduction with a thesis statement and then to follow this statement with a series of sentences, each sentence presenting one of the main points or claims of the essay. While this approach probably helps students organize their essays, spreading a thesis statement over several sentences in the introduction usually is not effective. For one thing, it can lead to an essay that develops several points but does not make meaningful or clear connections among the different ideas.

If you can state all of your main points logically in just one sentence, then all of those points should come together logically in just one essay. When I evaluate an essay, I look specifically for a one-sentence statement of the thesis in the introduction that, again, identifies the topic of the essay, states all of the main points, clarifies how those points are logically related, and conveys the purpose of the essay.

If you are used to using the high school model to present the thesis of an essay, you might wonder what you should do with the rest of your introduction once you start presenting a one-sentence statement of your thesis. Well, an introduction should do two important things: (1) present the thesis statement, and (2) get readers interested in the subject of the essay.

Instead of outlining each stage of an essay with separate sentences in the introduction, you could draw readers into your essay by appealing to their interests at the very beginning of your essay. Why should what you discuss in your essay be important to readers? Why should they care? Answering these questions might help you discover a way to draw readers into your essay effectively. Once you appeal to the interests of your readers, you should then present a clear and focused thesis statement. (And thesis statements most often appear at the ends of introductions, not at the beginnings.)

Coming up with a thesis statement during the early stages of the writing process is difficult. You might instead begin by deciding on three or four related claims or ideas that you think you could prove in your essay. Think in terms of paragraphs: choose claims that you think could be supported and developed well in one body paragraph each. Once you have decided on the three or four main claims and how they are logically related, you can bring them together into a one-sentence thesis statement.

All of the topic sentences in a short paper, when "added" together, should give us the thesis statement for the entire paper. Do the addition for your own papers, and see if you come up with the following:

Topic Sentence 1 + Topic Sentence 2 + Topic Sentence 3 = Thesis Statement

Organization

Effective expository papers generally are well organized and unified, in part because of fairly rigid guidelines that writers follow and that you should try to follow in your papers.

Each body paragraph of your paper should begin with a topic sentence, a statement of the main point of the paragraph. Just as a thesis statement conveys the main point of an entire essay, a topic sentence conveys the main point of a single body paragraph. As illustrated above, a clear and logical relationship should exist between the topic sentences of a paper and the thesis statement.

If the purpose of a paragraph is to persuade readers, the topic sentence should present a claim, or something that you can prove with specific evidence. If you begin a body paragraph with a claim, a point to prove, then you know exactly what you will do in the rest of the paragraph: prove the claim. You also know when to end the paragraph: when you think you have convinced readers that your claim is valid and well supported.

If you begin a body paragraph with a fact, though, something that it true by definition, then you have nothing to prove from the beginning of the paragraph, possibly causing you to wander from point to point in the paragraph. The claim at the beginning of a body paragraph is very important: it gives you a point to prove, helping you unify the paragraph and helping you decide when to end one paragraph and begin another.

The length and number of body paragraphs in an essay is another thing to consider. In general, each body paragraph should be at least half of a page long (for a double-spaced essay), and most expository essays have at least three body paragraph each (for a total of at least five paragraphs, including the introduction and conclusion.)

Support and Development of Ideas

The main difference between a convincing, insightful interpretation or argument and a weak interpretation or argument often is the amount of evidence than the writer uses. "Evidence" refers to specific facts.

Remember this fact: your interpretation or argument will be weak unless it is well supported with specific evidence. This means that, for every claim you present, you need to support it with at least several different pieces of specific evidence. Often, students will present potentially insightful comments, but the comments are not supported or developed with specific evidence. When you come up with an insightful idea, you are most likely basing that idea on some specific facts. To present your interpretation or argument well, you need to state your interpretation and then explain the facts that have led you to this conclusion.

Effective organization is also important here. If you begin each body paragraph with a claim, and if you then stay focused on supporting that claim with several pieces of evidence, you should have a well-supported and well-developed interpretation.

As stated above, each body paragraph generally should be at least half of a page long, so, if you find that your body paragraphs are shorter than this, then you might not be developing your ideas in much depth. Often, when a student has trouble reaching the required minimum length for an essay, the problem is the lack of sufficient supporting evidence.

In an interpretation or argument, you are trying to explain and prove something about your subject, so you need to use plenty of specific evidence as support. A good approach to supporting an interpretation or argument is dividing your interpretation or argument into a few significant and related claims and then supporting each claim thoroughly in one body paragraph.

Insight into Subject

Sometimes a student will write a well-organized essay, but the essay does not shed much light on the subject. At the same time, I am often amazed at the insightful interpretations and arguments that students come up with. Every semester, students interpret aspects of texts or present arguments that I had never considered.

If you are writing an interpretation, you should reread the text or study your subject thoroughly, doing your best to notice something new each time you examine it. As you come up with a possible interpretation to develop in an essay, you should re-examine your subject with that interpretation in mind, marking passages (if your subject is a literary text) and taking plenty of notes on your subject. Studying your subject in this way will make it easier for you to find supporting evidence for your interpretation as you write your essay.

The insightfulness of an essay often is directly related to the organization and the support and development of the ideas in the essay. If you have well-developed body paragraphs focused on one specific point each, then it is likely that you are going into depth with the ideas you present and are offering an insightful interpretation.

If you organize your essay well, and if you use plenty of specific evidence to support your thesis and the individual claims that comprise that thesis, then there is a good possibility that your essay will be insightful.

Clarity is always important: if your writing is not clear, your meaning will not reach readers the way you would like it to. According to IVCC's Grading Criteria for Writing Assignments , "A," "B," and "C" essays are clear throughout, meaning that problems with clarity can have a substantial effect on the grade of an essay.

If any parts of your essay or any sentences seem just a little unclear to you, you can bet that they will be unclear to readers. Review your essay carefully and change any parts of the essay that could cause confusion for readers. Also, take special note of any passages that your peer critiquers feel are not very clear.

"Style" refers to the kinds of words and sentences that you use, but there are many aspects of style to consider. Aspects of style include conciseness, variety of sentence structure, consistent verb tense, avoidance of the passive voice, and attention to the connotative meanings of words.

Several of the course web pages provide information relevant to style, including the following pages:

  • "Words, Words, Words"
  • Using Specific and Concrete Diction
  • Integrating Quotations into Sentences
  • Formal Writing Voice

William Strunk, Jr.'s, The Elements of Style is a classic text on style that is now available online.

Given the subject, purpose, and audience for each essay in this course, you should use a formal writing voice . This means that you should avoid use of the first person ("I," "me," "we," etc.), the use of contractions ("can't," "won't," etc.), and the use of slang or other informal language. A formal writing voice will make you sound more convincing and more authoritative.

If you use quotations in a paper, integrating those quotations smoothly, logically, and grammatically into your own sentences is important, so make sure that you are familiar with the information on the Integrating Quotations into Sentences page.

"Mechanics" refers to the correctness of a paper: complete sentences, correct punctuation, accurate word choice, etc. All of your papers for the course should be free or almost free from errors. Proofread carefully, and consider any constructive comments you receive during peer critiques that relate to the "mechanics" of your writing.

You might use the grammar checker if your word-processing program has one, but grammar checkers are correct only about half of the time. A grammar checker, though, could help you identify parts of the essay that might include errors. You will then need to decide for yourself if the grammar checker is right or wrong.

The elimination of errors from your writing is important. In fact, according to IVCC's Grading Criteria for Writing Assignments , "A," "B," and "C" essays contain almost no errors. Significant or numerous errors are a characteristic of a "D" or "F" essay.

Again, the specific errors listed in the second table above are explained on the Identifying and Eliminating Common Errors in Writing web page.

You should have a good understanding of what errors to look out for based on the feedback you receive on graded papers, and I would be happy to answer any questions you might have about possible errors or about any other aspects of your essay. You just need to ask!

Copyright Randy Rambo , 2021.

Automated Essay Scoring System Based on Rubric

  • First Online: 15 July 2017

Cite this chapter

criteria for scoring essay items

  • Megumi Yamamoto 3 ,
  • Nobuo Umemura 4 &
  • Hiroyuki Kawano 5  

Part of the book series: Studies in Computational Intelligence ((SCI,volume 727))

Included in the following conference series:

  • International Conference on Applied Computing and Information Technology

998 Accesses

10 Citations

1 Altmetric

In this paper, we propose an architecture of automated essay scoring system based on rubric, which combines automated scoring with human scoring. Rubrics are valid criteria for grading students’ essays. Our proposed rubric has five evaluation viewpoints “Contents, Structure, Evidence, Style, and Skill” and 25 evaluation items which are subdivided viewpoints. The system is cloud-based application and consists of several tools such as Moodle, R, MeCab, and RedPen. At first, the system automatically scores 11 items included in the Style and Skill such as sentence style, syntax, usage, readability, lexical richness, and so on. Then it predicts scores of Style and Skill from these items’ scores by multiple regression model. It also predicts Contents’ score by the cosine similarity between topics and descriptions. Moreover, our system classifies into five grades “A+, A, B, C, D” as useful information for teachers, by using machine learning techniques such as support vector machine. We try to improve automated scoring algorithms and a variety of input essays in order to improve accuracy of classification over 90%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
  • Durable hardcover edition

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

criteria for scoring essay items

An automated essay scoring systems: a systematic literature review

criteria for scoring essay items

Automated Essay Scoring Systems

criteria for scoring essay items

Shermis, M. D., Burstein, J.: Handbook of Automated Essay Evaluation: Current Applications and New Directions. Routledge, pp. 1–353 (2013)

Google Scholar  

Ishioka, T.: Latest trends in automated essay scoring and evaluation. Trans. Jpn. Soc. Artif. Intell. 23 (1), 17–24 (2008) (in Japanese)

Ishioka, T., Kameda, M.: Automated Japanese essay scoring system based on articles written by experts. In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pp. 233–240 (2006)

Ishioka, T.: Computer-based writing tests. J. Inst. Electron. Inf. Commun. Eng. 99 (10), 1005–1011 (2016) (in Japanese)

Attali, Y., Burstein, J.: Automated essay scoring with e-rater® V.2. J Technol. Learn. Assess. 4 (3), 3–30 (2006)

Vantage Learning: Research Summary IntelliMetric™ Scoring Accuracy Across Genres and Grade Levels. www.vantagelearning.com/docs/intellimetric/IM_ReseachSum-mary_IntelliMetric_Accuracy_Across_Genre_and_Grade_Levels.pdf

Association of American Colleges and Universities: Inquiry and analysis VALUE rubric. www.aacu.org/value/rubrics/inquiry-analysis

Matsushita, K.: Assessment of the quality of learning through performance assessment: based on the analysis of types of learning assessment. Kyoto Univ. Res. High. Edu. 18 , 75–114 (2012). (in Japanese)

Yamamoto, M., Umemura, N.: Analysis and Evaluation of Reports based on Lexical Richness. In: Moodle Moot Japan 2015 Proceedings, pp. 6–8 (2016) (in Japanese)

Recruit Technologies Co., Ltd.: RedPen. redpen.cc/

Sunakawa, Y., Lee, J., Takahara, M.: The construction of a database to support the compilation of Japanese learners dictionaries. Acta Linguistica Asiatica 2 (2), 97–115 (2012)

Article   Google Scholar  

Download references

Author information

Authors and affiliations.

School of Contemporary International Studies, Nagoya University of Foreign Studies, Nisshin, 470-0197, Japan

Megumi Yamamoto

School of Media and Design, Nagoya University of Arts and Sciences, Nagoya, 470-0196, Japan

Nobuo Umemura

Faculty of Science and Engineering, Nanzan University, Nagoya, 466-8673, Japan

Hiroyuki Kawano

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Megumi Yamamoto .

Editor information

Editors and affiliations.

Software Engg & Information Techn. Inst., Central Michigan University Software Engg & Information Techn. Inst., Mt. Pleasant, MI, USA

Appendix 1: Proposed Rubric for Human Scoring

Evaluation Viewpoint

Achievement Level and Scoring

D (0–1)

C (2–3)

B (4–5)

A (6–7)

A+ (8–9)

[Content]

Understanding of the assigned tasks and validity of contents

Misunderstanding the assigned task, or the contents are not related to the topic at all

Understanding the assigned task, but includes some errors

Understanding the assigned task, but the contents are insufficient

Understanding the assigned task, but has some points to improve

Appropriate contents with relevant terms.

No need for improvement

[Structure]

Logical development

No structure or theoretical development

There is a contradiction in the development of the theory

Although developing theory in order, there are some points to be improved

Although developing theory in order, the theory is not compelling

The theory is compelling and conveying the writer’s understanding

[Evidence]

Validity of sources and evidence

It does not show evidence

Demonstrates an attempt to support ideas

The sources to be referenced are inappropriate or unreliable

Uses relevant and reliable sources, but the way of reference is not suitable

Demonstrates the skillful use of high-quality and relevant sources

[Style]

Proper usage of grammar and elaboration of sentences

There are some

grammatical errors.

Many corrections required

Not following the rules.

Some corrections required

Almost follow the rules.

A few corrections required

Although error-free, some improvement will be better

Virtually error-free and well elaborated.

No point to improve

[Skill]

Readability and writing skill

The sentences are hard to read. Writing skills are missing

There are several points to be improved, such as the length of sentences

Although sentences can be read generally, some improvement will be better

Easy to read. Rich in vocabulary

Easy to read. Skillfully communicates meaning to readers. Rich in vocabulary

Appendix 2: Proposed Rubric for Automated Scoring

Evaluation Viewpoints

Evaluation Items

Automated Scoring

(0–9)

[Content]

1

Similarity between topic and description

Applicable

2

Presence of keywords

Applicable

3

Understanding of the writing task

Not applicable

4

Comprehensive evaluation of contents

5

Understanding of learning contents

[Structure]

6

Logic level

Not applicable

7

Validity of opinions and arguments

8

Division of facts and opinions

9

Persuasiveness

[Evidence]

10

Quality level of reference material

Not applicable

11

Relevance of reference material

12

Validity of reference material

13

Explanation about tables and figures

14

Validity of the quantity of citations

Conditionally applicable

[Style]

15

Unification of stylistics

Applicable

16

Eliminate misused or misspellings

17

Validity of syntax

18

Dependency of subject and predicate

19

Proper punctuation

20

Eliminate redundancy and double negation

21

Eliminate notation variability and ambiguity

[Skill]

22

Kanji usage rate

Applicable

23

Validity of sentence length

24

Lexical richness

25

Lexical level

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this chapter

Yamamoto, M., Umemura, N., Kawano, H. (2018). Automated Essay Scoring System Based on Rubric. In: Lee, R. (eds) Applied Computing & Information Technology. ACIT 2017. Studies in Computational Intelligence, vol 727. Springer, Cham. https://doi.org/10.1007/978-3-319-64051-8_11

Download citation

DOI : https://doi.org/10.1007/978-3-319-64051-8_11

Published : 15 July 2017

Publisher Name : Springer, Cham

Print ISBN : 978-3-319-64050-1

Online ISBN : 978-3-319-64051-8

eBook Packages : Engineering Engineering (R0)

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Center for Engaged Teaching and Learning

criteria for scoring essay items

Assessment of Learning

1. formative, interim, and summative assessments overview, 1.3. best practices for designing and grading exams.

The most obvious function of assessment methods (such as exams, quizzes, papers, and presentations) is to enable instructors to make judgments about the quality of student learning (i.e., assign grades). However, the method of assessment also can have a direct impact on the quality of student learning. Students assume that the focus of exams and assignments reflects the educational goals most valued by an instructor, and they direct their learning and studying accordingly  (McKeachie  & Svinicki, 2006). General grading systems can have an impact as well.  For example, a strict bell curve (i.e.,  norm-reference grading)  has the potential to dampen motivation and cooperation in a classroom, while a system that strictly rewards proficiency (i.e.,  criterion-referenced grading ) could be perceived as contributing to grade inflation. Given the importance of assessment for both faculty and student interactions about learning, how can instructors develop exams that provide useful and relevant data about their students' learning and also direct students to spend their time on the important aspects of a course or course unit? How do grading practices further influence this process?

Guidelines for Designing Valid and Reliable Exams

Ideally, effective exams have four characteristics:

  • Valid,  (providing useful information about the concepts they were designed to test),
  • Reliable  (allowing consistent measurement and discriminating between different levels of performance),
  • Recognizable   (instruction has prepared students for the assessment ), and
  • Realistic  (concerning time and effort required to complete the assignment)  (Svinicki, 1999). 

Most importantly, exams and assignments should f ocus on the most important content and behaviors  emphasized during the course (or particular section of the course). What are the primary ideas, issues, and skills you hope students learn during a particular course/unit/module? These are the  learning outcomes  you wish to measure. For example, if your learning outcome involves memorization, then you should assess for memorization or classification; if you hope students will develop problem-solving capacities, your exams should focus on assessing students’ application and analysis skills.  As a general rule, assessment s that focus too heavily on details (e.g., isolated facts, figures, etc.) “will probably lead to better student retention of the footnotes at the cost of the main points" (Halpern & Hakel, 2003, p. 40). As noted in Table 1, each type of exam item may be better suited to measuring some learning outcomes than others, and each has its advantages and disadvantages in terms of ease of design, implementation, and scoring.

Table 1: Advantages and Disadvantages of Commonly Used Types of Achievement Test Items

Many items can be administered in a relatively short time. Moderately easy to write; easily scored.

Limited primarily to testing knowledge of information.  Easy to guess correctly on many items, even if material has not been mastered.

Can be used to assess broad range of content in a brief period. Skillfully written items can measure higher order cognitive skills. Can be scored quickly.

Difficult and time consuming to write good items. Possible to assess higher order cognitive skills, but most items assess only knowledge.  Some correct answers can be guesses.

Items can be written quickly. A broad range of content can be assessed. Scoring can be done efficiently.

Higher order cognitive skills are difficult to assess.

Many can be administered in a brief amount of time. Relatively efficient to score. Moderately easy to write.

Difficult to identify defensible criteria for correct answers. Limited to questions that can be answered or completed in very few words.

Can be used to measure higher order cognitive skills. Relatively easy to write questions. Difficult for respondent to get correct answer by guessing.

Time consuming to administer and score. Difficult to identify reliable criteria for scoring. Only a limited range of content can be sampled during any one testing period.

Adapted from Table 10.1 of Worthen, et al., 1993, p. 261.

General Guidelines for Developing Multiple-Choice and Essay Questions

The following sections highlight general guidelines for developing multiple-choice and essay questions, which are often used in college-level assessment because they readily lend themselves to measuring higher order thinking skills  (e.g., application, justification, inference, analysis and evaluation).  Yet instructors often struggle to create, implement, and score these types of questions (McMillan, 2001; Worthen, et al., 1993).

Multiple-choice questions  have a number of advantages. First, they can measure various kinds of knowledge, including students' understanding of terminology, facts, principles, methods, and procedures, as well as their ability to apply, interpret, and justify. When carefully designed, multiple-choice items also can assess higher-order thinking skills.

Multiple-choice questions are less ambiguous than short-answer items, thereby providing a more focused assessment of student knowledge. Multiple-choice items are superior to true-false items in several ways: on true-false items, students can receive credit for knowing that a statement is incorrect, without knowing what is correct. Multiple-choice items offer greater reliability than true-false items as the opportunity for guessing is reduced with the larger number of options. Finally, an instructor can diagnose misunderstanding by analyzing the incorrect options chosen by students.

A disadvantage of multiple-choice items is that they require developing incorrect, yet plausible, options that can be difficult to create. In addition, multiple- choice questions do not allow instructors to measure students’ ability to organize and present ideas.  Finally, because it is much easier to create multiple-choice items that test recall and recognition rather than higher order thinking, multiple-choice exams run the risk of not assessing the deep learning that many instructors consider important (Greenland & Linn, 1990; McMillan, 2001).

Guidelines for writing multiple-choice items include advice about stems, correct answers, and distractors (McMillan, 2001, p. 150; Piontek, 2008):

  • S tems  pose the problem or question.
  • Is the stem stated as clearly, directly, and simply as possible?
  • Is the problem described fully in the stem?
  • Is the stem stated positively, to avoid the possibility that students will overlook terms like “no,” “not,” or “least”?
  • Does the stem provide only information relevant to the problem?

Possible responses include the correct answer and  distractors , or the incorrect choices. Multiple-choice questions usually have at least three distractors.

  • Are the distractors plausible to students who do not know the correct answer?
  • Is there only one correct answer?
  • Are all the possible answers parallel with respect to grammatical structure, length, and complexity?
  • Are the options short?
  • Are complex options avoided? Are options placed in logical order?
  • Are correct answers spread equally among all the choices? (For example, is answer “A” correct about the same number of times as options “B” or “C” or “D”)?

An example of good multiple-choice questions that assess higher-order thinking skills is the following test question from pharmacy (Park, 2008):

Patient WC was admitted for third-degree burns over 75% of his body. The attending physician asks you to start this patient on antibiotic therapy.  Which one of the following is the best reason why WC would need antibiotic prophylaxis? a. His burn injuries have broken down the innate immunity that prevents microbial invasion. b. His injuries have inhibited his cellular immunity. c. His injuries have impaired antibody production. d. His injuries have induced the bone marrow, thus activated immune system

A second question builds on the first by describing the patient’s labs two days later, asking the students to develop an explanation for the subsequent lab results. (See Piontek, 2008 for the full question.)

Essay questions  can tap complex thinking by requiring students to organize and integrate information, interpret information, construct arguments, give explanations, evaluate the merit of ideas, and carry out other types of reasoning  (Cashin, 1987; Gronlund & Linn, 1990; McMillan, 2001; Thorndike, 1997; Worthen, et al., 1993).  Restricted response  essay questions are good for assessing basic knowledge and understanding and generally require a brief written response (e.g., “State two hypotheses about why birds migrate.  Summarize the evidence supporting each hypothesis” [Worthen, et al., 1993, p. 277].)  Extended response  essay items allow students to construct a variety of strategies, processes, interpretations and explanations for a question, such as the following:

The framers of the Constitution strove to create an effective national government that balanced the tension between majority rule and the rights of minorities. What aspects of American politics favor majority rule? What aspects protect the rights of those not in the majority? Drawing upon material from your readings and the lectures, did the framers successfully balance this tension? Why or why not? (Shipan, 2008).

In addition to measuring complex thinking and reasoning, advantages of essays include the potential for motivating better study habits and providing the students flexibility in their responses.  Instructors can evaluate how well students are able to communicate their reasoning with essay items, and they are usually less time consuming to construct than multiple-choice items that measure reasoning.

The major disadvantages of essays include the amount of time instructors must devote to reading and scoring student responses, and  the importance of developing and using carefully constructed criteria/ rubric s to insure reliability of scoring. Essays can assess only a limited amount of content in one testing period/exam due to the length of time required for students to respond to each essay item. As a result, essays do not provide a good sampling of content knowledge across a curriculum (Gronlund & Linn, 1990; McMillan, 2001).

Guidelines for writing essay questions include the following (Gronlund & Linn, 1990; McMillan, 2001; Worthen, et al., 1993):

  • Restrict the use of essay questions to educational outcomes that are difficult to measure using other formats. For example, to test recall knowledge, true-false, fill-in-the-blank, or multiple-choice questions are better measures.
  • Generalizations : State a set of principles that can explain the following events.
  • Synthesis : Write a well-organized report that shows…
  • Evaluation : Describe the strengths and weaknesses of…
  • Write the question clearly so that students do not feel that they are guessing at “what the instructor wants me to do.”
  • Indicate the amount of time and effort students should spend on each essay item.
  • Avoid giving students options for which essay questions they should answer. This choice decreases the validity and reliability of the test because each student is essentially taking a different exam.
  • Consider using several narrowly focused questions (rather than one broad question) that elicit different aspects of students’ skills and knowledge.
  • Make sure there is enough time to answer the questions.

Guidelines for scoring essay questions include the following (Gronlund & Linn, 1990; McMillan, 2001; Wiggins, 1998; Worthen, et al., 1993;  Writing and grading essay questions , 1990):

  • Outline what constitutes an expected answer.
  • Select an appropriate scoring method based on the criteria. A  rubric  is a scoring key that indicates the criteria for scoring and the amount of points to be assigned for each criterion.  A sample rubric for a take-home history exam question might look like the following:

Number of references to class reading sources

0-2 references

3-5 references

6+ references

Historical accuracy

Lots of inaccuracies

Few inaccuracies

No apparent inaccuracies

Historical Argument

No argument made; little evidence for argument

Argument is vague and unevenly supported by evidence

Argument is clear and well-supported by evidence

Proof reading

Many grammar and spelling errors

Few (1-2) grammar or spelling errors

No grammar or spelling errors

For other examples of rubric s, see  CRLT Occasional Paper #24  (Piontek, 2008).

  • Clarify the role of writing mechanics and other factors independent of the educational outcomes being measured. For example, how does grammar or use of scientific notation figure into your scoring criteria?
  • Create anonymity for students’ responses while scoring and create a random order in which tests are graded (e.g., shuffle the pile) to increase accuracy of the scoring.
  • Use a systematic process for scoring each essay item.  Assessment guidelines suggest scoring all answers for an individual essay question in one continuous process, rather than scoring all answers to all questions for an individual student. This system makes it easier to remember the criteria for scoring each answer.

You can also use these guidelines for scoring essay items to create grading processes and rubric s for students’ papers, oral presentations, course projects, and websites.  For other grading strategies, see  Responding to Student Writing – Principles & Practices  and  Commenting Effectively on Student Writing .

Cashin, W. E. (1987).  Improving essay tests . Idea Paper, No. 17. Manhattan, KS: Center for Faculty Evaluation and Development, Kansas State University.

Gronlund, N. E., & Linn, R. L. (1990).  Measurement and evaluation in teaching   (6th  ed.). New  York:  Macmillan Publishing Company.

Halpern, D. H., & Hakel, M. D. (2003). Applying the science of learning to the university and beyond.  Change, 35 (4), 37-41.

McKeachie, W. J., & Svinicki, M. D. (2006). Assessing, testing, and evaluating: Grading is not the most important function.   In   McKeachie's    Teaching tips: Strategies, research, and theory for college and university teachers  (12th ed., pp. 74-86). Boston: Houghton Mifflin Company.

McMillan, J. H. (2001).  Classroom  assessment : Principles and practice for effective instruction.  Boston: Allyn and Bacon.

Park, J. (2008, February 4). Personal communication. University of Michigan College of Pharmacy.

Piontek, M. (2008).  Best practices for designing and grading exams.  CRLT Occasional Paper No. 24 . Ann Arbor, MI. Center for Research on Learning and Teaching.>

Shipan, C. (2008, February 4). Personal communication. University of Michigan Department of Political Science.

Svinicki, M.   D.   (1999a). Evaluating and grading students.  In  Teachers and students: A sourcebook for UT- Austin faculty  (pp. 1-14). Austin, TX: Center for Teaching Effectiveness, University of Texas at Austin.

Thorndike, R. M. (1997).  Measurement and evaluation in psychology and education.   Upper Saddle River, NJ: Prentice-Hall, Inc.

Wiggins, G. P. (1998).  Educative assessment : Designing assessment s to inform and improve student performance . San Francisco: Jossey-Bass Publishers.

Worthen, B.  R., Borg, W.  R.,  & White, K.  R.  (1993).  Measurement and evaluation in the schools .  New York: Longman.

Writing and grading essay questions. (1990, September).  For Your Consideration , No. 7. Chapel Hill, NC: Center for Teaching and Learning, University of North Carolina at Chapel Hill.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • PeerJ Comput Sci

Logo of peerjcs

Automated language essay scoring systems: a literature review

Mohamed abdellatif hussein.

1 Information and Operations, National Center for Examination and Educational Evaluation, Cairo, Egypt

Hesham Hassan

2 Faculty of Computers and Information, Computer Science Department, Cairo University, Cairo, Egypt

Mohammad Nassef

Associated data.

The following information was supplied regarding data availability:

As this is a literature, review, there was no raw data.

Writing composition is a significant factor for measuring test-takers’ ability in any language exam. However, the assessment (scoring) of these writing compositions or essays is a very challenging process in terms of reliability and time. The need for objective and quick scores has raised the need for a computer system that can automatically grade essay questions targeting specific prompts. Automated Essay Scoring (AES) systems are used to overcome the challenges of scoring writing tasks by using Natural Language Processing (NLP) and machine learning techniques. The purpose of this paper is to review the literature for the AES systems used for grading the essay questions.

Methodology

We have reviewed the existing literature using Google Scholar, EBSCO and ERIC to search for the terms “AES”, “Automated Essay Scoring”, “Automated Essay Grading”, or “Automatic Essay” for essays written in English language. Two categories have been identified: handcrafted features and automatically featured AES systems. The systems of the former category are closely bonded to the quality of the designed features. On the other hand, the systems of the latter category are based on the automatic learning of the features and relations between an essay and its score without any handcrafted features. We reviewed the systems of the two categories in terms of system primary focus, technique(s) used in the system, the need for training data, instructional application (feedback system), and the correlation between e-scores and human scores. The paper includes three main sections. First, we present a structured literature review of the available Handcrafted Features AES systems. Second, we present a structured literature review of the available Automatic Featuring AES systems. Finally, we draw a set of discussions and conclusions.

AES models have been found to utilize a broad range of manually-tuned shallow and deep linguistic features. AES systems have many strengths in reducing labor-intensive marking activities, ensuring a consistent application of scoring criteria, and ensuring the objectivity of scoring. Although many techniques have been implemented to improve the AES systems, three primary challenges have been identified. The challenges are lacking of the sense of the rater as a person, the potential that the systems can be deceived into giving a lower or higher score to an essay than it deserves, and the limited ability to assess the creativity of the ideas and propositions and evaluate their practicality. Many techniques have only been used to address the first two challenges.

Introduction

Test items (questions) are usually classified into two types: selected-response (SR), and constructed-response (CR). The SR items, such as true/false, matching or multiple-choice, are much easier than the CR items in terms of objective scoring ( Isaacs et al., 2013 ). SR questions are commonly used for gathering information about knowledge, facts, higher-order thinking, and problem-solving skills. However, considerable skill is required to develop test items that measure analysis, evaluation, and other higher cognitive skills ( Stecher et al., 1997 ).

CR items, sometimes called open-ended, include two sub-types: restricted-response and extended-response items ( Nitko & Brookhart, 2007 ). Extended-response items, such as essays, problem-based examinations, and scenarios, are like restricted-response items, except that they extend the demands made on test-takers to include more complex situations, more difficult reasoning, and higher levels of understanding which are based on real-life situations requiring test-takers to apply their knowledge and skills to new settings or situations ( Isaacs et al., 2013 ).

In language tests, test-takers are usually required to write an essay about a given topic. Human-raters score these essays based on specific scoring rubrics or schemes. It occurs that the score of an essay scored by different human-raters vary substantially because human scoring is subjective ( Peng, Ke & Xu, 2012 ). As the process of human scoring takes much time, effort, and are not always as objective as required, there is a need for an automated essay scoring system that reduces cost, time and determines an accurate and reliable score.

Automated Essay Scoring (AES) systems usually utilize Natural Language Processing and machine learning techniques to automatically rate essays written for a target prompt ( Dikli, 2006 ). Many AES systems have been developed over the past decades. They focus on automatically analyzing the quality of the composition and assigning a score to the text. Typically, AES models exploit a wide range of manually-tuned shallow and deep linguistic features ( Farag, Yannakoudakis & Briscoe, 2018 ). Recent advances in the deep learning approach have shown that applying neural network approaches to AES systems has accomplished state-of-the-art results ( Page, 2003 ; Valenti, Neri & Cucchiarelli, 2017 ) with the additional benefit of using features that are automatically learnt from the data.

Survey methodology

The purpose of this paper is to review the AES systems literature pertaining to scoring extended-response items in language writing exams. Using Google Scholar, EBSCO and ERIC, we searched the terms “AES”, “Automated Essay Scoring”, “Automated Essay Grading”, or “Automatic Essay” for essays written in English language. AES systems which score objective or restricted-response items are excluded from the current research.

The most common models found for AES systems are based on Natural Language Processing (NLP), Bayesian text classification, Latent Semantic Analysis (LSA), or Neural Networks. We have categorized the reviewed AES systems into two main categories. The former is based on handcrafted discrete features bounded to specific domains. The latter is based on automatic feature extraction. For instance, Artificial Neural Network (ANN)-based approaches are capable of automatically inducing dense syntactic and semantic features from a text.

The literature of the two categories has been structurally reviewed and evaluated based on certain factors including: system primary focus, technique(s) used in the system, the need for training data, instructional application (feedback system), and the correlation between e-scores and human scores.

Handcrafted features AES systems

Project essay grader™ (peg).

Ellis Page developed the PEG in 1966. PEG is considered the earliest AES system that has been built in this field. It utilizes correlation coefficients to predict the intrinsic quality of the text. It uses the terms “trins” and “proxes” to assign a score. Whereas “trins” refers to intrinsic variables like diction, fluency, punctuation, and grammar,“proxes” refers to correlations between intrinsic variables such as average length of words in a text, and/or text length. ( Dikli, 2006 ; Valenti, Neri & Cucchiarelli, 2017 ).

The PEG uses a simple scoring methodology that consists of two stages. The former is the training stage and the latter is the scoring stage. PEG should be trained on a sample of essays from 100 to 400 essays, the output of the training stage is a set of coefficients ( β weights) for the proxy variables from the regression equation. In the scoring stage, proxes are identified for each essay, and are inserted into the prediction equation. To end, a score is determined by estimating coefficients ( β weights) from the training stage ( Dikli, 2006 ).

Some issues have been marked as a criticism for the PEG such as disregarding the semantic side of essays, focusing on surface structures, and not working effectively in case of receiving student responses directly (which might ignore writing errors). PEG has a modified version released in 1990, which focuses on grammar checking with a correlation between human assessors and the system ( r  = 0.87) ( Dikli, 2006 ; Page, 1994 ; Refaat, Ewees & Eisa, 2012 ).

Measurement Inc. acquired the rights of PEG in 2002 and continued to develop it. The modified PEG analyzes the training essays and calculates more than 500 features that reflect intrinsic characteristics of writing, such as fluency, diction, grammar, and construction. Once the features have been calculated, the PEG uses them to build statistical and linguistic models for the accurate prediction of essay scores ( Home—Measurement Incorporated, 2019 ).

Intelligent Essay Assessor™ (IEA)

IEA was developed by Landauer (2003) . IEA uses a statistical combination of several measures to produce an overall score. It relies on using Latent Semantic Analysis (LSA); a machine-learning model of human understanding of the text that depends on the training and calibration methods of the model and the ways it is used tutorially ( Dikli, 2006 ; Foltz, Gilliam & Kendall, 2003 ; Refaat, Ewees & Eisa, 2012 ).

IEA can handle students’ innovative answers by using a mix of scored essays and the domain content text in the training stage. It also spots plagiarism and provides feedback ( Dikli, 2006 ; Landauer, 2003 ). It uses a procedure for assigning scores in a process that begins with comparing essays to each other in a set. LSA examines the extremely similar essays. Irrespective of the replacement of paraphrasing, synonym, or reorganization of sentences, the two essays will be similar LSA. Plagiarism is an essential feature to overcome academic dishonesty, which is difficult to be detected by human-raters, especially in the case of grading a large number of essays ( Dikli, 2006 ; Landauer, 2003 ). ( Fig. 1 ) represents IEA architecture ( Landauer, 2003 ). IEA requires smaller numbers of pre-scored essays for training. On the contrary of other AES systems, IEA requires only 100 pre-scored training essays per each prompt vs. 300–500 on other systems ( Dikli, 2006 ).

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-05-208-g001.jpg

Landauer (2003) used IEA to score more than 800 students’ answers in middle school. The results showed a 0.90 correlation value between IEA and the human-raters. He explained the high correlation value due to several reasons including that human-raters could not compare each essay to each other for the 800 students while IEA can do so ( Dikli, 2006 ; Landauer, 2003 ).

E-rater ®

Educational Testing Services (ETS) developed E-rater in 1998 to estimate the quality of essays in various assessments. It relies on using a combination of statistical and NLP techniques to extract linguistic features (such as grammar, usage, mechanics, development) from text to start processing, then compares scores with human graded essays ( Attali & Burstein, 2014 ; Dikli, 2006 ; Ramineni & Williamson, 2018 ).

The E-rater system is upgraded annually. The current version uses 11 features divided into two areas: writing quality (grammar, usage, mechanics, style, organization, development, word choice, average word length, proper prepositions, and collocation usage), and content or use of prompt-specific vocabulary ( Ramineni & Williamson, 2018 ).

The E-rater scoring model consists of two stages: the model of the training stage, and the model of the evaluation stage. Human scores are used for training and evaluating the E-rater scoring models. The quality of the E-rater models and its effective functioning in an operational environment depend on the nature and quality of the training and evaluation data ( Williamson, Xi & Breyer, 2012 ). The correlation between human assessors and the system ranged from 0.87 to 0.94 ( Refaat, Ewees & Eisa, 2012 ).

Criterion SM

Criterion is a web-based scoring and feedback system based on ETS text analysis tools: E-rater ® and Critique. As a text analysis tool, Critique integrates a collection of modules that detect faults in usage, grammar, and mechanics, and recognizes discourse and undesirable style elements in writing. It provides immediate holistic scores as well ( Crozier & Kennedy, 1994 ; Dikli, 2006 ).

Criterion similarly gives personalized diagnostic feedback reports based on the types of assessment instructors give when they comment on students’ writings. This component of the Criterion is called an advisory component. It is added to the score, but it does not control it[18]. The types of feedback the advisory component may provide are like the following:

  • • The text is too brief (a student may write more).
  • • The essay text does not look like other essays on the topic (the essay is off-topic).
  • • The essay text is overly repetitive (student may use more synonyms) ( Crozier & Kennedy, 1994 ).

IntelliMetric™

Vantage Learning developed the IntelliMetric systems in 1998. It is considered the first AES system which relies on Artificial Intelligence (AI) to simulate the manual scoring process carried out by human-raters under the traditions of cognitive processing, computational linguistics, and classification ( Dikli, 2006 ; Refaat, Ewees & Eisa, 2012 ).

IntelliMetric relies on using a combination of Artificial Intelligence (AI), Natural Language Processing (NLP) techniques, and statistical techniques. It uses CogniSearch and Quantum Reasoning technologies that were designed to enable IntelliMetric to understand the natural language to support essay scoring ( Dikli, 2006 ).

IntelliMetric uses three steps to score essays as follows:

  • a) First, the training step that provides the system with known scores essays.
  • b) Second, the validation step examines the scoring model against a smaller set of known scores essays.
  • c) Finally, application to new essays with unknown scores. ( Learning, 2000 ; Learning, 2003 ; Shermis & Barrera, 2002 )

IntelliMetric identifies text related characteristics as larger categories called Latent Semantic Dimensions (LSD). ( Figure 2 ) represents the IntelliMetric features model.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-05-208-g002.jpg

IntelliMetric scores essays in several languages including English, French, German, Arabic, Hebrew, Portuguese, Spanish, Dutch, Italian, and Japanese ( Elliot, 2003 ). According to Rudner, Garcia, and Welch ( Rudner, Garcia & Welch, 2006 ), the average of the correlations between IntelliMetric and human-raters was 0.83 ( Refaat, Ewees & Eisa, 2012 ).

MY Access is a web-based writing assessment system based on the IntelliMetric AES system. The primary aim of this system is to provide immediate scoring and diagnostic feedback for the students’ writings in order to motivate them to improve their writing proficiency on the topic ( Dikli, 2006 ).

MY Access system contains more than 200 prompts that assist in an immediate analysis of the essay. It can provide personalized Spanish and Chinese feedback on several genres of writing such as narrative, persuasive, and informative essays. Moreover, it provides multilevel feedback—developing, proficient, and advanced—as well ( Dikli, 2006 ; Learning, 2003 ).

Bayesian Essay Test Scoring System™ (BETSY)

BETSY classifies the text based on trained material. It has been developed in 2002 by Lawrence Rudner at the College Park of the University of Maryland with funds from the US Department of Education ( Valenti, Neri & Cucchiarelli, 2017 ). It has been designed to automate essay scoring, but can be applied to any text classification task ( Taylor, 2005 ).

BETSY needs to be trained on a huge number (1,000 texts) of human classified essays to learn how to classify new essays. The goal of the system is to determine the most likely classification of an essay to a set of groups (Pass-Fail) and (Advanced - Proficient - Basic - Below Basic) ( Dikli, 2006 ; Valenti, Neri & Cucchiarelli, 2017 ). It learns how to classify a new document through the following steps:

The first-word training step is concerned with the training of words, evaluating database statistics, eliminating infrequent words, and determining stop words.

The second-word pairs training step is concerned with evaluating database statistics, eliminating infrequent word-pairs, maybe scoring the training set, and trimming misclassified training sets.

Finally, BETSY can be applied to a set of experimental texts to identify the classification precision for several new texts or a single text. ( Dikli, 2006 )

BETSY has achieved accuracy of over 80%, when trained with 462 essays, and tested with 80 essays ( Rudner & Liang, 2002 ).

Automatic featuring AES systems

Automatic text scoring using neural networks.

Alikaniotis, Yannakoudakis, and Rei introduced in 2016 a deep neural network model capable of learning features automatically to score essays. This model has introduced a novel method to identify the more discriminative regions of the text using: (1) a Score-Specific Word Embedding (SSWE) to represent words and (2) a two-layer Bidirectional Long-Short-Term Memory (LSTM) network to learn essay representations. ( Alikaniotis, Yannakoudakis & Rei, 2016 ; Taghipour & Ng, 2016 ).

Alikaniotis and his colleagues have extended the C&W Embeddings model into the Augmented C&W model to capture, not only the local linguistic environment of each word, but also how each word subsidizes to the overall score of an essay. In order to capture SSWEs . A further linear unit has been added in the output layer of the previous model which performs linear regression, predicting the essay score ( Alikaniotis, Yannakoudakis & Rei, 2016 ). Figure 3 shows the architectures of two models, (A) Original C&W model and (B) Augmented C&W model. Figure 4 shows the example of (A) standard neural embeddings to (B) SSWE word embeddings.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-05-208-g003.jpg

(A) Original C&W model. (B) Augmented C&W model.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-05-208-g004.jpg

(A) Standard neural embeddings. (B) SSWE word embeddings.

SSWEs obtained by their model used to derive continuous representations for each essay. Each essay is identified as a sequence of tokens. The uni- and bi-directional LSTMs have been efficiently used for embedding long sequences ( Alikaniotis, Yannakoudakis & Rei, 2016 ).

They used the Kaggle’s ASAP ( https://www.kaggle.com/c/asap-aes/data ) contest dataset. It consists of 12.976 essays, with average length 150-to-550 words per essay, each double marked (Cohen’s = 0.86). The essays presented eight different prompts, each with distinct marking criteria and score range.

Results showed that SSWE and the LSTM approach, without any prior knowledge of the language grammar or the text domain, was able to mark the essays in a very human-like way, beating other state-of-the-art systems. Furthermore, while tuning the models’ hyperparameters on a separate validation set ( Alikaniotis, Yannakoudakis & Rei, 2016 ), they did not perform any further preprocessing of the text other than simple tokenization. Also, it outperforms the traditional SVM model by combining SSWE and LSTM. On the contrary, LSTM alone did not give significant more accuracies compared to SVM.

According to Alikaniotis, Yannakoudakis, and Rei ( Alikaniotis, Yannakoudakis & Rei, 2016 ), the combination of SSWE with the two-layer bi-directional LSTM had the highest correlation value on the test set averaged 0.91 (Spearman) and 0.96 (Pearson).

A neural network approach to automated essay scoring

Taghipour and H. T. Ng developed in 2016 a Recurrent Neural Networks (RNNs) approach which automatically learns the relation between an essay and its grade. Since the system is based on RNNs, it can use non-linear neural layers to identify complex patterns in the data and learn them, and encode all the information required for essay evaluation and scoring ( Taghipour & Ng, 2016 ).

The designed model architecture can be presented in five layers as follow:

  • a) The Lookup Table Layer; which builds d LT dimensional space containing each word projection.
  • b) The Convolution Layer; which extracts feature vectors from n-grams. It can possibly capture local contextual dependencies in writing and therefore enhance the performance of the system.
  • c) The Recurrent Layer; which processes the input to generate a representation for the given essay.
  • d) The Mean over Time; which aggregates the variable number of inputs into a fixed length vector.
  • e) The Linear Layer with Sigmoid Activation; which maps the generated output vector from the mean-over-time layer to a scalar value ( Taghipour & Ng, 2016 ).

Taghipour and his colleagues have used the Kaggle’s ASAP contest dataset. They distributed the data set into 60% training set, 20% a development set, and 20% a testing set. They used Quadratic Weighted Kappa (QWK) as an evaluation metric. For evaluating the performance of the system, they compared it to an available open source AES system called the ‘Enhanced AI Scoring Engine’ (EASE) ( https://github.com/edx/ease ). To identify the best model, they performed several experiments like Convolutional vs. Recurrent Neural Network, basic RNN vs. Gated Recurrent Units (GRU) vs. LSTM, unidirectional vs. Bidirectional LSTM, and using with vs. without mean-over-time layer ( Taghipour & Ng, 2016 ).

The results showed multiple observations according to ( Taghipour & Ng, 2016 ), summarized as follows:

  • a) RNN failed to get accurate results as LSTM or GRU and the other models outperformed it. This was possibly due to the relatively long sequences of words in writing.
  • b) The neural network performance was significantly affected with the absence of the mean over-time layer. As a result, it did not learn the task in an exceedingly proper manner.
  • c) The best model was the combination of ten instances of LSTM models with ten instances of CNN models. The new model outperformed the baseline EASE system by 5.6% and with averaged QWK value 0.76.

Automatic features for essay scoring—an empirical study

Dong and Zhang provided in 2016 an empirical study to examine a neural network method to learn syntactic and semantic characteristics automatically for AES, without the need for external pre-processing. They built a hierarchical Convolutional Neural Network (CNN) structure with two levels in order to model sentences separately ( Dasgupta et al., 2018 ; Dong & Zhang, 2016 ).

Dong and his colleague built a model with two parts, summarized as follows:

  • a) Word Representations: A word embedding is used but does not rely on POS-tagging or other pre-processing.
  • b) CNN Model: They took essay scoring as a regression task and employed a two-layer CNN model, in which one Convolutional layer is used to extract sentences representations, and the other is stacked on sentence vectors to learn essays representations.

The dataset that they employed in experiments is the Kaggle’s ASAP contest dataset. The settings of data preparation followed the one that Phandi, Chai, and Ng used ( Phandi, Chai & Ng, 2015 ). For domain adaptation (cross-domain) experiments, they followed Phandi, Chai, and Ng ( Phandi, Chai & Ng, 2015 ), by picking four pairs of essay prompts, namely, 1 → 2, 3 →4, 5 →6 and 7 →8, where 1 →2 denotes prompt one as source domain and prompt 2 as target domain. They used quadratic weighted Kappa (QWK) as the evaluation metric.

In order to evaluate the performance of the system, they compared it to EASE system (an open source AES available for public) with its both models Bayesian Linear Ridge Regression (BLRR) and Support Vector Regression (SVR).

The Empirical results showed that the two-layer Convolutional Neural Network (CNN) outperformed other baselines (e.g., Bayesian Linear Ridge Regression) on both in-domain and domain adaptation experiments on the Kaggle’s ASAP contest dataset. So, the neural features learned by CNN were very effective in essay marking, handling more high-level and abstract information compared to manual feature templates. In domain average, QWK value was 0.73 vs. 0.75 for human rater ( Dong & Zhang, 2016 ).

Augmenting textual qualitative features in deep convolution recurrent neural network for automatic essay scoring

In 2018, Dasgupta et al. (2018) proposed a Qualitatively enhanced Deep Convolution Recurrent Neural Network architecture to score essays automatically. The model considers both word- and sentence-level representations. Using a Hierarchical CNN connected with a Bidirectional LSTM model they were able to consider linguistic, psychological and cognitive feature embeddings within a text ( Dasgupta et al., 2018 ).

The designed model architecture for the linguistically informed Convolution RNN can be presented in five layers as follow:

  • a) Generating Embeddings Layer: The primary function is constructing previously trained sentence vectors. Sentence vectors extracted from every input essay are appended with the formed vector from the linguistic features determined for that sentence.
  • b) Convolution Layer: For a given sequence of vectors with K windows, this layer function is to apply linear transformation for all these K windows. This layer is fed by each of the generated word embeddings from the previous layer.
  • c) Long Short-Term Memory Layer: The main function of this layer is to examine the future and past sequence context by connecting Bidirectional LSTMs (Bi-LSTM) networks.
  • d) Activation layer: The main function of this layer is to obtain the intermediate hidden layers from the Bi-LSTM layer h 1 , h 2 ,…, h T , and in order to calculate the weights of sentence contribution to the final essay’s score (quality of essay). They used an attention pooling layer over sentence representations.
  • e) The Sigmoid Activation Function Layer: The main function of this layer is to perform a linear transformation of the input vector that converts it to a scalar value (continuous) ( Dasgupta et al., 2018 ).

Figure 5 represents the proposed linguistically informed Convolution Recurrent Neural Network architecture.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-05-208-g005.jpg

Dasgupta and his colleagues employed in their experiments the Kaggle’s ASAP contest dataset. They have done 7 folds using cross validation technique to assess their models. Every fold is distributed as follows; training set which represents 80% of the data, development set represented by 10%, and the rest 10% as the test set. They used quadratic weighted Kappa (QWK) as the evaluation metric.

The results showed that, in terms of all these parameters, the Qualitatively Enhanced Deep Convolution LSTM (Qe-C-LSTM) system performed better than the existing, LSTM, Bi-LSTM and EASE models. It achieved a Pearson’s and Spearman’s correlation of 0.94 and 0.97 respectively as compared to that of 0.91 and 0.96 in Alikaniotis, Yannakoudakis & Rei (2016) . They also accomplished an RMSE score of 2.09. They computed a pairwise Cohen’s k value of 0.97 as well ( Dasgupta et al., 2018 ).

Summary and Discussion

Over the past four decades, there have been several studies that examined the approaches of applying computer technologies on scoring essay questions. Recently, computer technologies have been able to assess the quality of writing using AES technology. Many attempts have taken place in developing AES systems in the past years ( Dikli, 2006 ).

The AES systems do not assess the intrinsic qualities of an essay directly as human-raters do, but they utilize the correlation coefficients of the intrinsic qualities to predict the score to be assigned to an essay. The performance of these systems is evaluated based on the comparison of the scores assigned to a set of essays scored by expert humans.

The AES systems have many strengths mainly in reducing labor-intensive marking activities, overcoming time, cost, and improving the reliability of writing tasks. Besides, they ensure a consistent application of marking criteria, therefore facilitating equity in scoring. However, there is a substantial manual effort involved in reaching these results on different domains, genres, prompts, and so forth. Moreover, the linguistic features intended to capture the aspects of writing to be assessed are hand-selected and tuned for specific domains. In order to perform well on different data, separate models with distinct feature sets are typically tuned ( Burstein, 2003 ; Dikli, 2006 ; Hamp-Lyons, 2001 ; Rudner & Gagne, 2001 ; Rudner & Liang, 2002 ). Despite its weaknesses, the AES systems continue to attract the attention of public schools, universities, testing agencies, researchers and educators ( Dikli, 2006 ).

The AES systems described in this paper under the first category are based on handcrafted features and, usually, rely on regression methods. They employ several methods to obtain the scores. While E-rater and IntelliMetric use NLP techniques, the IEA system utilizes LSA. Moreover, PEG utilizes proxy measures (proxes), and BETSY™ uses Bayesian procedures to evaluate the quality of a text.

While E-rater, IntelliMetric, and BETSY evaluate style and semantic content of essays, PEG only evaluates style and ignores the semantic aspect of essays. Furthermore, IEA is exclusively concerned with semantic content. Unlike PEG, E-rater, IntelliMetric, and IEA need smaller numbers of pre-scored essays for training in contrast with BETSY which needs a huge number of training pre-scored essays.

The systems in the first category have high correlations with human-raters. While PEG, E-rater, IEA, and BETSY evaluate only English language essay responses, IntelliMetric evaluates essay responses in multiple languages.

Contrary to PEG, IEA, and BETSY, E-rater, and IntelliMetric have instructional or immediate feedback applications (i.e., Criterion and MY Access!). Instructional-based AES systems have worked hard to provide formative assessments by allowing students to save their writing drafts on the system. Thus, students can review their writings as of the formative feedback received from either the system or the teacher. The recent version of MY Access! (6.0) provides online portfolios and peer review.

The drawbacks of this category may include the following: (a) feature engineering can be time-consuming, since features need to be carefully handcrafted and selected to fit the appropriate model, and (b) such systems are sparse and instantiated by discrete pattern-matching.

AES systems described in this paper under the second category are usually based on neural networks. Neural Networking approaches, especially Deep Learning techniques, have been shown to be capable of inducing dense syntactic and semantic features automatically, applying them to text analysis and classification problems including AES systems ( Alikaniotis, Yannakoudakis & Rei, 2016 ; Dong & Zhang, 2016 ; Taghipour & Ng, 2016 ), and giving better results with regards to the statistical models used in the handcrafted features ( Dong & Zhang, 2016 ).

Recent advances in Deep Learning have shown that neural approaches to AES achieve state-of-the-art results ( Alikaniotis, Yannakoudakis & Rei, 2016 ; Taghipour & Ng, 2016 ) with the additional advantage of utilizing features that are automatically learned from the data. In order to facilitate interpretability of neural models, a number of visualization techniques have been proposed to identify textual (superficial) features that contribute to model performance [7].

While Alikaniotis and his colleagues ( 2016 ) employed a two-layer Bidirectional LSTM combined with the SSWE for essay scoring tasks, Taghipour & Ng (2016) adopted the LSTM model and combined it with CNN. Dong & Zhang (2016) developed a two-layer CNN, and Dasgupta and his colleagues ( 2018 ) proposed a Qualitatively Enhanced Deep Convolution LSTM. Unlike Alikaniotis and his colleagues ( 2016 ), Taghipour & Ng (2016) , Dong & Zhang (2016) , Dasgupta and his colleagues ( 2018 ) were interested in word-level and sentence-level representations as well as linguistic, cognitive and psychological feature embeddings. All linguistic and qualitative features were figured off-line and then entered in the Deep Learning architecture.

Although Deep Learning-based approaches have achieved better performance than the previous approaches, the performance may not be better using the complex linguistic and cognitive characteristics, which are very important in modeling such essays. See Table 1 for the comparison of AES systems.

PEG™Ellis Page1966StyleStatisticalYes (100 –400)No0.87
IEA™Landauer, Foltz, & Laham1997ContentLSA (KAT engine by PEARSON)Yes (∼100)Yes0.90
E-rater ETS development team1998Style & ContentNLPYes (∼400)Yes (Criterion)∼0.91
IntelliMetric™Vantage Learning1998Style & ContentNLPYes (∼300)Yes (MY Access!)∼0.83
BETSY™Rudner1998Style & ContentBayesian text classificationYes (1000)No∼0.80
Alikaniotis, Yannakoudakis, and Rei2016Style & ContentSSWE + Two-layer Bi-LSTMYes (∼8000)No∼0.91 (Spearman) ∼0.96 (Pearson)
Taghipour and Ng2016Style & ContentAdopted LSTMYes (∼7786)NOQWK for LSTM ∼0.761
Dong and Zhang2016Syntactic and semantic featuresWord embedding and a two-layer Convolution Neural NetworkYes (∼1500 to ∼1800)NOaverage kappa ∼0.734 versus 0.754 for human
Dasgupta, T., Naskar, A., Dey, L., & Saha, R.2018Style, Content, linguistic and psychologicalDeep Convolution Recurrent Neural NetworkYes ( ∼8000 to 10000)NOPearson’s and Spearman’s correlation of 0.94 and 0.97 respectively

In general, there are three primary challenges to AES systems. First, they are not able to assess essays as human-raters do because they do what they have been programmed to do ( Page, 2003 ). They eliminate the human element in writing assessment and lack the sense of the rater as a person ( Hamp-Lyons, 2001 ). This shortcoming was somehow overcome by obtaining high correlations between the computer and human-raters ( Page, 2003 ) although this is still a challenge.

The second challenge is whether the computer can be fooled by students or not ( Dikli, 2006 ). It is likely to “trick” the system by writing a longer essay to obtain higher score for example ( Kukich, 2000 ). Studies, such as the GRE study in 2001, examined whether a computer could be deceived and assign a lower or higher score to an essay than it should deserve or not. The results revealed that it might reward a poor essay ( Dikli, 2006 ). The developers of AES systems have been utilizing algorithms to detect students who try to cheat.

Although automatic learning AES systems based on Neural Networks algorithms, the handcrafted AES systems transcend automatic learning systems in one important feature. Handcrafted systems are highly related to the scoring rubrics that have been designed as a criterion for assessing a specific essay and human-raters use these rubrics to score essays a well. The objectivity of human-raters is measured by their commitment to the scoring rubrics. On the contrary, automatic learning systems extract the scoring criteria using machine learning and neural networks, which may include some factors that are not part of the scoring rubric, and, hence, is reminiscent of raters’ subjectivity (i.e., mode, nature of a rater’s character, etc.) Considering this point, handcrafted AES systems may be considered as more objective and fairer to students from the viewpoint of educational assessment.

The third challenge is measuring the creativity of human writing. Accessing the creativity of ideas and propositions and evaluating their practicality are still a pending challenge to both categories of AES systems which still needs further research.

Funding Statement

The authors received no funding for this work.

Additional Information and Declarations

The authors declare there are no competing interests.

Mohamed Abdellatif Hussein conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables.

Hesham Hassan and Mohammad Nassef authored or reviewed drafts of the paper, approved the final draft.

IMAGES

  1. Criteria For Marking Essay

    criteria for scoring essay items

  2. Scoring Rubric for the Essays

    criteria for scoring essay items

  3. Essay Rubric

    criteria for scoring essay items

  4. In-Class Essay Scoring Rubric by Teach Simple

    criteria for scoring essay items

  5. Essay Scoring Rubric High School

    criteria for scoring essay items

  6. Essay Grading Rubric Template

    criteria for scoring essay items

COMMENTS

  1. Tips for Creating and Scoring Essay Tests

    Here's a look at essay tests as a whole with advice about creating and scoring essay tests.

  2. Rubric Best Practices, Examples, and Templates

    Rubric Best Practices, Examples, and Templates A rubric is a scoring tool that identifies the different criteria relevant to an assignment, assessment, or learning outcome and states the possible levels of achievement in a specific, clear, and objective way. Use rubrics to assess project-based student work including essays, group projects, creative endeavors, and oral presentations.

  3. PDF Nov03PDF.pmd

    Rules for Scoring Essay and Short-Answer Items Because of their subjective nature, essay and short-answer items are difficult to grade, particularly if the score scale contains many points. The same items that are easy to grade on a 3-point scale may be very hard to grade on a 5- or 10-point scale.

  4. Structure and Scoring of the Assessment

    The Scoring Guide The Scoring Guide outlines the characteristics typical of essays at six different levels of competence. Readers assign each essay a score according to its main qualities.

  5. (PDF) Essay Items

    Abstract This encyclopedic entry on essay items provides a general definition of this item type, scoring procedures, and challenges to gathering validity evidence.

  6. PDF Handbook on Test Development: Helpful Tips for Creating Reliable and

    2.2 Writing Essay Test Items Essay items are useful when examinees have to show how they arrived at an answer. A test of writing ability is a good example of the kind of test that should be given in an essay response format. This type of item, however, is difficult to score reliably and can require a significant amount of time to be graded. Grading is often affected by the verbal fluency in ...

  7. Best Practices for Designing and Grading Exams

    You can also use these guidelines for scoring essay items to create grading processes and rubrics for students' papers, oral presentations, course projects, and websites.

  8. 3 Easy Steps to Grading Student Essays

    A rubric is a chart used in grading essays, special projects and other more items which can be more subjective. It lists each of the grading criteria separately and defines the different performance levels within those criteria. Standardized tests like the SAT's use rubrics to score writing samples, and designing one for your own use is easy if you take it step by step. Keep in mind that ...

  9. PDF Constructed-Response Scoring -- Doing it Right

    Constructed-response items (often known as performance tasks) — A task that requires test takers to construct answers rather than select from predetermined multiple-choice options; examples include essays, works of art or speeches. Rubric — The set of scoring standards that describes the criteria for each score level.

  10. Essay Exams

    Outline what constitutes an acceptable answer (criteria for knowledge and skills). Select an appropriate scoring method based on the criteria. Clarify the role of writing mechanics and other factors independent of the learning aims being measured. each essay item. For instance, score all responses to a single question in one setting.

  11. Evaluation Criteria for Formal Essays

    An excellent paper: Argument: The paper knows what it wants to say and why it wants to say it. It goes beyond pointing out comparisons to using them to change the reader?s vision. Organization: Every paragraph supports the main argument in a coherent way, and clear transitions point out why each new paragraph follows the previous one. Evidence ...

  12. Improving Your Test Questions

    Objective items include multiple-choice, true-false, matching and completion, while subjective items include short-answer essay, extended-response essay, problem solving and performance test items. For some instructional purposes one or the other item types may prove more efficient and appropriate.

  13. Classroom Assessment

    C. Scoring Essay Items Although essay questions are powerful assessment tools, they can be difficult to score. With essays, there isn't a single, correct answer and it is almost impossible to use an automatic scantron or computer-based system. In order to minimize the subjectivity and bias that may occur in the assessment, teachers should prepare a list of criteria prior to scoring the essays ...

  14. University of Maryland School of Nursing

    Evaluating Written Items A key point in scoring supply-type written responses, such as essay items, is that the criteria for scoring must be determined in advance! Typically this involves two areas - content as well as the structure and style of the respondent's writing. This discussion will focus on content.

  15. PDF Microsoft Word

    Since essay questions typically sample a limited range of content, are time consuming to score, and involve greater subjectivity in scoring than objectively scored items, the use of essay questions should be reserved for learning outcomes that cannot be better assessed by some other means.

  16. 13 Best Practices for Grading Essays and Performance Tests

    Learn best practices for grading bar exam essays to ensure that they serve as reliable and valid indicators of competence to practice law.

  17. [Pdf] Scoring in The Essay Tests Questions: Methods, Challenges and

    This study tries to explain various methods of essay tests scoring in different published papers. Materials & Methods: In this study, first the different papers and articles published in national and international journals were selected by using keywords of essay test, scoring, and student assessment. Later, they were studied and analyzed.

  18. ENG 1001: Evaluation Criteria for Essays

    Evaluation and Grading Criteria for Essays IVCC's online Style Book presents the Grading Criteria for Writing Assignments. This page explains some of the major aspects of an essay that are given special attention when the essay is evaluated.

  19. Automated Essay Scoring Systems

    As a result, automated essay scoring systems generate a single score or detailed evaluation of predefined assessment features. This chapter describes the evolution and features of automated scoring systems, discusses their limitations, and concludes with future directions for research and practice.

  20. Automated Essay Scoring System Based on Rubric

    Abstract. In this paper, we propose an architecture of automated essay scoring system based on rubric, which combines automated scoring with human scoring. Rubrics are valid criteria for grading students' essays. Our proposed rubric has five evaluation viewpoints "Contents, Structure, Evidence, Style, and Skill" and 25 evaluation items ...

  21. Assessment of Learning

    You can also use these guidelines for scoring essay items to create grading processes and rubric s for students' papers, oral presentations, course projects, and websites.

  22. PDF Guide to Scoring Rubrics

    A scoring rubric is a standard of performance for a defined population. It is a pre-determined set of goals and objectives on which to base an evaluation. In the Higher Education Report, S.M. Brookhart describes a scoring rubric as, "Descriptive scoring schemes that are developed by teachers or other evaluators to guide the analysis of the products or processes of students' efforts."

  23. Automated language essay scoring systems: a literature review

    However, the assessment (scoring) of these writing compositions or essays is a very challenging process in terms of reliability and time. The need for objective and quick scores has raised the need for a computer system that can automatically grade essay questions targeting specific prompts.