What type of validity requires comparison with scores from other measures?

Validity has long been one of the major deities in the pantheon of the psychometrician. It is universally praised, but the good works done in its name are remarkably few.
— Robert Ebel

As noted by Ebel [1961], validity is considered the most important feature of a testing program. Its status should not be surprising given that yet, it often does not receive the attention it deserves. As a result, tests can end up being misaligned or unrelated to what they are intended to measure, with scores that have limited meaning or usefulness.

Consider studying diligently for an introductory measurement exam that assesses topics like validity only superficially with recall questions rather than essays that require a deep evaluation of competing ideas. You might Or consider a screening test for a job in customer service that measures the extraversion but not agreeableness of candidates. In each of these examples, the results may not support

Validity encompasses all design considerations, administration procedures, and relating to the testing process that makes score inferences useful and meaningful. Test results that are consistent and based on items written according to specified content standards with appropriate levels of difficulty and discrimination are more useful and meaningful than scores that do not have these qualities. Correct scaling, sound test construction, and rigorous statistical analysis are thus all prerequisites for validity.

This chapter begins with an overview of validity, including definitions of key terms and concepts as well as some historical perspective. Common sources of validity evidence are then discussed in detail, including:

  1. test content,
  2. response processes,
  3. dimensionality and internal structure,
  4. relationships with other variables, and
  5. consequences of test use.

Note that much of our discussion around these five sources will incorporate information presented in previous chapters. Test content and response processes will draw from Chapters 3 and 4. Dimensionality and internal structure will elaborate on information from Chapters 5 through 8. Relationships with other variables and consequences of test use are more unique to this chapter and involve some new concepts.

The three sources of validity evidence are discussed within what is referred to as a unified view of validity.

Learning objectives

  1. Define validity in terms of test score interpretation and use, and identify and describe examples of this definition in context.
  2. Compare and contrast three main sources of validity evidence [content, criterion, and construct], with examples of how each type is established, including the validation process involved with each.
  3. Explain the structure and function of a test outline, and how it is used to provide evidence of content validity.
  4. Calculate and interpret a validity coefficient, describing what it represents and how it supports criterion validity.
  5. Describe how unreliability can attenuate a correlation, and how to correct for attenuation in a validity coefficient.
  6. Identify appropriate sources of validity evidence for given testing applications and describe how certain sources are more appropriate than others for certain applications.
  7. Describe the unified view of validity and how it differs from and improves upon the traditional view of validity.
  8. Identify threats to validity, including features of a test, testing process, or score interpretation or use, that impact validity. Consider, for example, the issues of content underrepresentation and misrepresentation, and construct irrelevant variance.

R analysis in this chapter is minimal. We’ll run correlations and make adjustments to them using the base R functions, and we’ll simulate scores using epmr.

# R setup for this chapter library["epmr"] # Functions we'll use # cor[] from the base package # subset[] from the base package for subsetting data # rsim[] from epmr to simulate scores # rstudy[] from epmr for finding alpha

Overview of validity

Some context

Suppose you are conducting a research study on the efficacy of a reading intervention. Scores on a reading test will be compared for a treatment group who participated in the intervention and a control group who did not. A statistically significant difference in mean reading scores for the two groups will be taken as evidence of an effective intervention, suggesting its utility with the broader population. This is an inferential use of statistics, as defined in Chapter 1.

In measurement, we step back and evaluate the extent to which our mean scores for each group accurately support what they are intended to measure. On the surface, the means themselves may differ above and beyond statistical error. But if neither mean actually captures average reading ability for our target population, our results are misleading, and our intervention may not actually be as effective as we expect. Instead, it may appear effective, or not, because of error in our measurements.

Reliability, from Chapter 5, focuses on the consistency of measurement. With reliability, we estimate the amount of variability in scores that can be attributed to a reliable source, and, conversely, the amount of variability that can be attributed to an unreliable source, that is, random error. As consistency increases, we gain confidence in the stability and replicability of our measurement.

While reliability is essential, it does not tell us whether a reliable source of variability is the source we hope it is. This is the job of validity. With validity, we more comprehensively examine the quality of our items as individual components of the target construct. We examine other sources of variability in our scores, such as item and test bias. We also examine relationships between scores on our items and other measures of the target construct.

Within the context presented above, strong reliability evidence may indicate that repeated administrations of our reading test would produce similar means for our treatment and control groups. Strong validity evidence could then indicate, for example, that our reading test sufficiently covers the indended domain of content and skill while not disadvantaging certain populations of test takers. Validity evidence could also demonstrate that our test aligns with existing standards of instruction, and that it is sufficiently distinct from measures of broader cognitive ability and aptitude. As validity evidence increases, we gain confidence in the overall meaningfulness of our measurement and its effectiveness in informing decisions, such as in an intervention study.

Definitions

With some context for understanding how validity relates to other measurement concepts, we’re now ready for some formal definitions of terms. The Standards for Educational and Psychological Testing [AERA, APA, and NCME 2014] define validity as “the degree to which evidence and theory support the interpretations of test scores entailed by the proposed uses of a test.” This definition is simple, but very broad, potentially encompassing a wide range of evidence and theory. We’ll focus, as the Standards do, on three specific sources of validity information, information based on test content, other measures, and theoretical models.

Note that validity theory has evolved considerably since the early 1900s. The three main sources of validity evidence that we cover here have been treated as separate “types” of validity, where a given instrument could be validated using one type but not another. This evolution is due in part to the increasing complexity of available psychometric methods, starting with correlation and an emphasis on relationships with criterion variables

Recent literature on validity theory has clarified that tests and even test scores themselves are not valid or invalid. Instead, only score inferences and interpretations are valid or invalid [e.g., Kane 2013]. Tests are then described as being valid only for a particular use. This is a simple distinction in the definition of validity, but some authors continue to highlight it. Referring to a test or test score as valid implies that it is valid for any use, even though this is likely not the case. Shorthand is sometimes used to refer to tests themselves as valid, because it is simpler than distinguishing between tests, uses, and interpretations. However, the assumption is always that validity only applies to a specific test use and not broadly to the test itself.

Finally, Kane [2013] and others also clarify that validity is a matter of degree. It is establish incrementally through an accumulation of supporting evidence. Validity is not inherent in a test, and it is not simply declared to exist by a test developer. Instead, data are collected and research is conducted to establish evidence supporting a test for a particular use. As this evidence builds, so does our confidence that test scores can be used for their intended purpose.

Validity examples

To evaluate the proposed score interpretations and uses for a test, and the extent to which they are valid, we should first examine the purpose of the test itself. As discussed in Chapters 2 and 3, a good test purpose articulates key information about the test, including what it measures [the construct], for whom [the intended population], and why [for what reason]. The question then becomes, given the quality of its contents, how they were constructed, and how they are implemented, is the test valid for this purpose?

As a first example, lets return to the test of early literacy introduced in Chapter 2. Documentation for the test [www.myigdis.com] claims that,

myIGDIs are a comprehensive set of assessments for monitoring the growth and development of young children. myIGDIs are easy to collect, sensitive to small changes in children’s achievement, and mark progress toward a long-term desired outcome. For these reasons, myIGDIs are an excellent choice for monitoring English Language Learners and making more informed Special Education evaluations.

Different types of validity evidence would be needed to support the claims made for the IGDI measures. The comprehensiveness of the measures could be documented via test outlines that are based on a broad but well-defined content domain, and that are vetted by content experts, including teachers. Multiple test forms would be needed to monitor growth, and the quality and equivalence of these forms could be established using appropriate reliability estimates and measurement scaling techniques, such as Rasch modeling. Ease of data collection could be documented by the simplicity and clarity of the test manual and administration instructions, which could be evaluated by users, and the length and complexity of the measures. The sensitivity of the measures to small changes in achievement and their relevance to long-term desired outcomes could be documented using statistical relationships between IGDI scores and other measures of growth and achievement within a longitudinal study. Finally, all of these sources of validity evidence would need to be gathered both for English Language Learners and other target groups in special education. These various forms of information all fit into the sources of validity evidence discussed below.

As a second example, consider a test construct that interests you. What construct are you interested in measuring? Perhaps it is one construct measured within a larger research study? How could you measure this construct? What type of test are you going to use? And what types of score[s] from the test will be used to support decision making? Next, consider who is going to take this test. Be as specific as possible when identifying your target population, the individuals that your work or research focuses on. Finally, consider why these people are taking your test. What are you going to do with the test scores? What are your proposed score interpretations and uses? Having defined your test purpose, consider what type of evidence would prove that the test is doing what you intend it to do, or that the score interpretations and uses are what you intend them to be. What information would support your test purpose?

Sources of validity evidence

The information gathered to support a test purpose, and establish validity evidence for the intended uses of a test, is often categorized into three main areas of validity evidence. These are content, criterion, and construct validity. Nowadays, these are referred to as sources of validity evidence, where

  • content focuses on the test content and procedures for developing the test,
  • criterion focuses on external measures of the same target construct, and
  • construct focuses on the theory underlying the construct and includes relationships with other measures.

In certain testing situations, one source of validity evidence may be more relevant than another. However, all three are often used together to argue that the evidence supporting a test is “adequate.”

We will review each source of validity evidence in detail, and go over some practical examples of when one is more relevant than another. In this discussion, consider your own example, and examples of other tests you’ve encountered, and what type of validity evidence would support their use.

Content validity

According to Haynes, Richard, and Kubany [1995], content validity is “the degree to which elements of an assessment instrument are relevant to and representative of the targeted construct for a particular assessment purpose.” Note that this definition of content validity is very similar to our original definition of validity. The difference is that content validity focuses on elements of the construct and how well they are represented in our test. Thus, content validity assumes the target construct can be broken down into elements, and that we can obtain a representative sample of these elements.

Having defined the purpose of our test and the construct we are measuring, there are three main steps to establishing content validity evidence:

  1. Define the content domain based on relevant standards, skills, tasks, behaviors, facets, factors, etc. that represent the construct. The idea here is that our construct can be represented in terms of specific identifiable dimensions or components, some of which may be more relevant to the construct than others.
  2. Use the defined content domain to create a blueprint or outline for our test. The blueprint organizes the test based on the relevant components of the content domain, and describes how each of these components will be represented within the test.
  3. Subject matter experts evaluate the extent to which our test blueprint adequately captures the content domain, and the extent to which our test items will adequately sample from the content domain.

Here is an overview of how content validity could be established for the IGDI measures of early literacy. Again, the purpose of the test is to identify preschoolers in need of additional support in developing early literacy skills.

1. Define the content domain

The early literacy content domain is broken down into a variety of content areas, including alphabet principles [e.g., knowledge of the names and sounds of letters], phonemic awareness [e.g., awareness of the sounds that make up words], and oral language [e.g., definitional vocabulary]. The literature on early literacy has identified other important skills, but we’ll focus here on these three. Note that the content domain for a construct should be established both by research and practice.

2. Outline the test

Next, we map the portions of our test that will address each area of the content domain. The test outline can include information about the type of items used, the cognitive skills required, and the difficulty levels that are targeted, among other things. Review Chapter 4 for additional details on test outlines or blueprints.

Table 9.1 contains an example of a test outline for the IGDI measures. The three content areas listed above are shown in the first column. These are then broken down further into cognitive processes or skills. Theory and practical constraints determine reasonable numbers and types of test items or tasks devoted to each cognitive process in the test itself. The final column shows the percentage of the total test that is devoted to each area.

Table 9.1: Example Test Outline for a Measure of Early LiteracyContent AreaCognitive processItemsWeight
Alphabet principles Letter naming 20 13%
Sound identification 20 13%
Phonological awareness Rhyming 15 10%
Alliteration 15 10%
Sound blending 10 7%
Oral language Picture naming 30 20%
Which one doesn’t belong 20 13%
Sentence completion 20 13%

3. Evaluate

Validity evidence requires that the test outline be representative of the content domain and appropriate for the construct and test purpose. The appropriateness of an outline is typically evaluated by content experts. In the case of the IGDI measures, these experts could be researchers in the area of early literacy, and teachers who work directly with students from the target population.

Licensure testing

Here is an example of content validity from the area of licensure/certification testing. I have consulted with an organization that develops and administers tests of medical imaging, including knowledge assessments taken by candidates for certification in radiography. This area provides a unique example of content validity, because the test itself measures a construct that is directly tied to professional practice. If practicing radiographers utilize a certain procedure, that procedure, or the knowledge required to perform it, should be included in the test.

The domain for a licensure/certification test such as this is defined using what is referred to as a job analysis or practice analysis [Raymond 2001]. A job analysis is a research study, the central feature of which is a survey sent to practitioners that lists a wide range of procedures and skills potentially used in the field. Respondents indicate how often they perform each procedure or use each skill on the survey. Procedures and skills performed by a high percentage of professionals are then included in the test outline. As in the previous examples, the final step in establishing content validity is having a select group of experts review the procedures and skills and their distribution across the test, as organized in the test outline.

Psychological measures

Content validity is relevant in non-cognitive psychological testing as well. Suppose the purpose of a test is to measure client experience with panic attacks so as to determine the efficacy of treatment. The domain for this test could be defined using criteria listed in the DSM-V [www.dsm5.org], reports about panic attack frequency, and secondary effects of panic attacks. The test outline would organize the number and types of items written to address all relevant criteria from the DSM-V. Finally, experts who work directly in clinical settings would evaluate the test outline to determine its quality, and their evaluation would provide evidence supporting the content validity of the test for this purpose.

Threats to content validity

When considering the appropriateness of our test content, we must also be aware of how content validity evidence can be compromised. What does content invalidity look like? For example, if our panic attack scores were not valid for a particular use, how would this lack of validity manifest itself in the process of establishing content validity?

Here are two main sources of content invalidity. First, if items reflecting domain elements that are important to the construct are omitted from our test outline, the construct will be underrepresented in the test. In our panic attack example, if the test does not include items addressing “nausea or abdominal distress,” other criteria, such as “fear of dying,” may have too much sway in determining an individual’s score. Second, if unnecessary items measuring irrelevant or tangential material are included, the construct will be misrepresented in the test. For example, if items measuring depression are included in the scoring process, the score itself is less valid as a measure of the target construct.

Together, these two threats to content validity lead to unsupported score inferences. Some worst-case-scenario consequences include misdiagnoses, failure to provide needed treatment, or the provision of treatment that is not needed. In licensure testing, the result can be the licensing of candidates who lack the knowledge, skills, and abilities required for safe and effective practice.

Criterion validity

Definition

Criterion validity is the degree to which test scores correlate with, predict, or inform decisions regarding another measure or outcome. If you think of content validity as the extent to which a test correlates with or corresponds to the content domain, criterion validity is similar in that it is the extent to which a test correlates with or corresponds to another test. So, in content validity we compare our test to the content domain, and hope for a strong relationship, and in criterion validity we compare our test to a criterion variable, and again hope for a strong relationship.

Validity by association

The keyword in this definition of criterion validity is correlate, which is synonymous with relate or predict. The assumption here is that the construct we are hoping to measure with our test is known to be measured well by another test or observed variable. This other test or variable is often referred to as a “gold standard,” a label presumably given to it because it is based on strong validity evidence. So, in a way, criterion validity is a form of validity by association. If our test correlates with a known measure of the construct, we can be more confident that our test measures the same construct.

The equation for a validity coefficient is the same as the equations for correlation that we encountered in previous chapters. Here we denote our test as \[X\] and the criterion variable as \[Y\]. The validity coefficient is the correlation between the two, which can be obtained as the covariance divided by the product of the individual standard deviations.

\[\begin{equation} \rho_{XY} = \frac{\sigma_{XY}}{\sigma_{X}\sigma_{Y}} \tag{9.1} \end{equation}\]

Criterion validity is sometimes distinguished further as concurrent validity, where our test and the criterion are administered concurrently, or predictive validity, where our test is measured first and can then be used to predict the future criterion. The distinction is based on the intended use of scores from our test for predictive purposes.

Criterion validity is limited because it does not actually require that our test be a reasonable measure of the construct, only that it relate strongly with another measure of the construct. Nunnally and Bernstein [1994] clarify this point with a hypothetical example:

If it were found that accuracy in horseshoe pitching correlated highly with success in college, horseshoe pitching would be a valid measure for predicting success in college.

The scenario is silly, but it highlights the fact that, on it’s own, criterion validity is insufficient. The take-home message is that you should never use or trust a criterion relationship as your sole source of validity evidence.

There are two other challenges associated with criterion validity. First, finding a suitable criterion can be difficult, especially if your test targets a new or not well defined construct. Second, a correlation coefficient is attenuated, or reduced in strength, by any unreliability present in the two measures being correlated. So, if your test and the criterion test are unreliable, a low validity coefficient [the correlation between the two tests] may not necessarily represent a lack of relationship between the two tests. It may instead represent a lack of reliable information with which to estimate the criterion validity coefficient.

Attenuation

Here’s a demonstration of how attenuation works, based on PISA. Suppose we find a gold standard criterion measure of reading ability, and administer it to students in the US who took the reading items in PISA09. First, we calculate a total score on the PISA reading items, then we compare it to some simulated test scores for our criterion test. Scores have been simulated to correlate at 0.80.

# Get the vector of reading items names ritems

Chủ Đề