CALPER Language Assessment

Center for Advanced Language Proficiency Education and Research at The Pennsylvania State University

Assessment Terminology

Like many fields, assessment has its own specific vocabulary. Below you will find an explanation of many key terms in assessment.

Evaluation, Assessment, and Testing are often used interchangeably in everyday conversations, but each has particular meanings among assessment professionals and researchers. (Note that in addition to its particular meaning, assessment can also encompass all three terms, as when we refer to assessment terminology or assessment professionals.)

  • Evaluation is the broadest of the three concepts because it concerns the programmatic level. Thus, evaluation is the systematic process of gathering and interpreting information about an entire program, including learners’ and teachers’ attitudes, the curriculum being followed and materials being used, teachers’ professional development opportunities, how effectively the program is meeting its stated objectives, etc.
  • Assessment is narrower in scope than evaluation as it focuses on learner achievement and the processes of learner development. The term assessment comes from the Latin assidere, to sit beside, and thus involves observing and gathering information about learners. This observation and information gathering can be accomplished through a variety of approaches, but the most well known form of assessment today is testing.
  • Testing involves the use of a formal assessment instrument to measure learners’ knowledge and abilities. In many cases, tests themselves are standardized so that all learners are asked precisely the same questions in precisely the same order and under specified conditions (including, for example, time limitations and restrictions on supporting materials like dictionaries or grammar references that may or may not be used during the test).

Assessment purpose refers to the reason for conducting the assessment and suggests how the information obtained through the assessment process will be used. Shepard (2000) has identified three major categories of assessment purposes: administrative, instructional, and research.

  • Administrative assessments are used to screen applicants for acceptance into a program, to place learners at appropriate levels of study, to certify competencies or mastery, and to promote individuals.
  • Instructional assessments are used to diagnose learners’ strengths and weaknesses, to provide evidence of learner progress, to offer feedback to teachers and students, and to evaluate the curriculum.
  • Research assessments involve experimentation designed to understand processes of language learning and language use.

Classroom assessment and external assessment are two ways of categorizing instructional assessments and refer to the agents who design and/or administer the assessment. Classroom assessments are used by individual teachers with their students while external assessments are created by private companies or government agencies and administered on a very large scale (e.g., the TOEFL, the SAT).


Standards are descriptive statements of what learners must know or be able to do in order to demonstrate competence or proficiency at various levels within a domain of study. Standards are intended to represent up-to-date theory and knowledge within the domain and often serve as the basis for educational programs. Assessments are then employed to determine whether the standards are being met.

  • Content or curriculum standards define the essential knowledge that all students must master.
  • Performance standards specify the quality of performance that learners must display in order to demonstrate mastery or proficiency.
  • Opportunity-to-learn or School-delivery standards describe the resources needed to meet the content and performance standards.

Alternative or Complementary Assessment are terms that were introduced as various approaches to assessment became more popular. These terms were meant to set these approaches apart from testing, which has traditionally been the dominant form of assessment. Examples of alternative assessments include projects, portfolios, games, debates, interviews, and learner presentations. Alternative assessments are often described as authentic and performance-based.


Authentic assessment is a term associated with alternative assessments because the kinds of tasks learners are asked to perform are intended to reflect better the demands and contexts of everyday life than do the questions that comprise traditional tests.


Performance assessment means that learners are asked to demonstrate their language knowledge or abilities in some way other than answering traditional test questions; that is, they are asked to do something using the language. This might involve creating a product or performing a certain language function.


Assessment Program means that classroom-based assessment is best conceived of as an ongoing process rather than a one-off testing episode. An effective assessment program will make use of multiple assessment instruments and will include a focus on both product (what learners are able to do) and process (how learners orient to assessment tasks, strategies they employ) throughout the course of study.


Achievement testing/assessment involves assessing what students have learned in a particular course or program of study.

Proficiency testing/assessment refers to the assessment of knowledge or ability within a given domain but not restricted to any course or program (e.g., general or overall language ability).

Summative assessment occurs at the end of the program of study.

Formative assessment occurs during the program of study and is used to inform subsequent teaching and learning.


Rubrics include various dimensions of a task presented as hierarchical descriptors in order to inform assessment decisions. For example, one dimension of a writing task might be the coherence and organization of the piece. A rubric would explain to students what constitutes excellent (good, average, poor, etc.) coherence and organization.

Rating Scales are similar to rubrics in that they provide criteria for determining the quality of performance.

  • Holistic rating scales assign a global rating to the entire performance.
  • Analytic rating scales focus on specific components of performance, such as fluency or accuracy.

Validity involves whether or not an assessment actually assesses the knowledge or abilities it is intended to assess.

  • Content validity is concerned with the extent to which the assessment represents the content that one wishes to assess. Content validity is particularly relevant to the classroom because it examines how well an assessment is connected to a curriculum or set of standards.
  • Construct validity has to do with whether the assessment is in line with a current theory or understanding of the ability being assessed.
  • Concurrent validity reveals how well performance on one assessment matches performance on another assessment thought to measure the same knowledge or abilities.
  • Predictive validity examines whether one’s assessment performance can be used to accurately predict an individual’s performance of given tasks or in given contexts.
  • Face validity simply refers to whether the assessment looks like it assesses what it is intended to assess.

Reliability refers to the consistency of an assessment score or grade. Reliability recognizes that the view of abilities that emerges from an assessment procedure may reflect other variables. In testing theory, these other variables are considered sources of measurement error because they obscure individuals’ true abilities. For this reason, an observed test score is believed to be comprised of a true score and error. Assessment procedures typically try to control for or minimize error.

  • Inter-rater reliability is the reliability, or agreement, among various individuals applying given criteria to rate performance.
  • Intra-rater reliability is the degree to which the same rater consistently evaluates performances.
  • Test-retest reliability concerns the stability of test scores from one administration of the test to the next.

Internal consistency is the extent to which individual test items are thought to measure the same ability.

Test item analysis helps to determine the quality of individual items on a test. Each test item should reveal some feature of the knowledge or abilities under assessment. Items are described according to their level of difficulty and their discriminating power. This kind of analysis is especially important when an assessment is intended to group or classify test takers.

Level of difficulty of a test item is determined by how many test takers answered the item correctly. If most or all test takers answered an item correctly, it may be too easy (or, conversely, an item may be too difficult if few or no test takers got it right).

Discrimination index is calculated for each test item to reveal whether individual items distinguish well between weaker and stronger learners. An item that has good discriminating power will be answered correctly by test takers who attain high scores on the test but not by those who score poorly.
Next section: Classroom Assessment

Skip to toolbar