Who would be able to complete this discussion?

Chapter 4:

Course text:
Cascio, W. F., & Aguinis, H. (2019).

Applied psychology in talent management

(8th ed.). Retrieved from https://www.vitalsource.com

4 CRITERIA: DEFINITIONS, MEASURES, AND EVALUATION

Wayne F. Cascio, Herman Aguinis

Learning Goals

By the end of this chapter, you will be able to do the following:
· 4.1 Define and identify criteria—yardsticks used to assess employee success
· 4.2 Distinguish the static, dynamic, and individual dimensions of criteria and their implications
· 4.3 Define and measure contextual, task, and counterproductive behaviors
· 4.4 Develop criteria that address challenges such as job performance unreliability, unreliability in the observation of performance, and the multidimensionality of performance
· 4.5 Consider the importance of situational determinants of performance and develop and evaluate criteria using standards such as relevance, sensitivity, and practicality
· 4.6 Develop criteria that will minimize the detrimental impact of criterion deficiency and contamination and choose whether to use composite or multiple criteria
· 4.7 Distinguish observed from unobserved criteria and their antecedents
· 4.8 Consider nonnormal distributions of performance and their implications in terms of the presence and production of star performers
The development of criteria that are adequate and appropriate is at once a stumbling block and a challenge to the HR specialist. Behavioral scientists have bemoaned the “criterion problem” through the years. The term refers to the difficulties involved in the process of conceptualizing and measuring performance constructs that are multidimensional, dynamic, and appropriate for different purposes (Austin & Villanova, 1992). Yet the effectiveness and future progress of knowledge with respect to most HR policies and interventions depend fundamentally on our ability to resolve this baffling question.
The challenge is to develop theories, concepts, and measurements that will achieve the twin objectives of enhancing the utility of available procedures and programs and deepening our understanding of the psychological and behavioral processes involved in job performance. Ultimately, our goal is to develop a comprehensive theory of the performance of men and women at work (Campbell & Wiernik, 2015; Viswesvaran & Ones, 2000).
In the early days of applied psychology, according to Jenkins (1946), most researchers and practitioners tended to accept the tacit assumption that criteria were either given by God or just to be found lying about. It is regrettable that even today we often resort to the most readily available or most expedient criteria when, with a little more effort and thought, we could probably develop much better ones. Nevertheless, progress has been made as the field has come to recognize that criterion measures are samples of a larger performance universe and that as much effort should be devoted to understanding and validating criteria as is devoted to identifying predictors of them (Campbell & Wiernik, 2015). Wallace (1965) expressed the matter aptly when he said that the answer to the question “Criteria for what?” must certainly include “for understanding” (p. 417). Let’s begin by defining our terms.

Definition

Criteria has been defined from more than one point of view. From one perspective, criteria are standards that can be used as yardsticks for measuring employees’ degree of success on the job (Bass & Barrett, 1981; Guion, 1965; Landy & Conte, 2016). This definition is quite adequate within the context of personnel selection, placement, promotion, succession planning, and performance management. It is useful when prediction is involved—that is, in the establishment of a functional relationship between one variable, the predictor, and another variable, the criterion. However, there are times when we simply wish to evaluate without necessarily predicting. Suppose, for example, that the HR department is concerned with evaluating the effectiveness of a recruitment campaign aimed at attracting members of underrepresented groups (e.g., women for science, technology, engineering, and mathematics—STEM—positions). Various criteria must be used to evaluate the program adequately. The goal in this case is not prediction but rather evaluation. Fundamentally, one distinction between predictors and criteria is time (Mullins & Ratliff, 1979). For example, if evaluative standards such as written or performance tests are administered before an employment decision is made (i.e., to hire or to promote), the standards are labeled predictors. If evaluative standards are administered after an employment decision has been made (i.e., to evaluate performance effectiveness), the standards are labeled criteria.
This discussion leads to the conclusion that a more comprehensive definition is required, regardless of whether we are predicting or evaluating. As such, a more general definition is that a criterion represents something important or desirable. It is an operational statement of the goals or desired outcomes of the program under study (Astin, 1964). It is an evaluative standard that can be used to measure a person’s performance, attitude, motivation, and so forth (Blum & Naylor, 1968). Examples of criteria are presented in
Table 4.1
, which has been modified from those offered by Dunnette and Kirchner (1965), Guion (1965), and others (e.g., Aguinis, O’Boyle, Gonzalez-Mulé, & Joo, 2016; Brock, Martin, & Buckley, 2013). Although many of these measures often would fall short as adequate criteria, each of them deserves careful study in order to develop a comprehensive sampling of job or program performance. There are several other requirements of criteria in addition to desirability and importance, but before examining them, we must first consider the use of job performance as a criterion.

Table 4.1 Possible Measures of Criteria

Output Measures

Commission earnings
Dollar volume of sales
Number of candidates attracted (recruitment program)
Number of items sold
Number of letters typed
Number of new patents (or creative/innovative inventions and projects)
Number of publications in scientific journals
Readership of an advertisement
Units produced

Quality Measures Cost of spoiled or rejected work Number of complaints and dissatisfied persons (clients, customers, subordinates, colleagues) Number of errors (coding, filing, bookkeeping, typing, diagnosing) Number of errors detected (inspector, troubleshooter, service person) Number of policy renewals (insurance sales) Rate of scrap, reworks, or breakage

Lost Time Employee turnover (individual-, team-, and unit-level turnover) Frequency of cyberloafing Frequency of non-work-related e-mail sent and received at work Frequency of long coffee or smoke breaks taken without approval Length and frequency of unauthorized pauses Length of service Number of discharges for cause Number of occasions (or days) absent Number of times tardy Number of transfers due to unsatisfactory performance Number of voluntary quits

Employability, Trainability, and Promotability Length of time between promotions Level of proficiency reached in a given time Number of promotions in a specified time period Number of times considered for promotion Rate of salary increase Time to reach standard performance

Ratings of Performance

Ratings of behavioral expectations
Ratings of performance in simulations and role-playing exercises
Ratings of performance in work samples
Ratings of personal traits or characteristics
Ratings of skills

Counterproductive Behaviors

Abuse toward others (e.g., bullying)
Disciplinary transgressions
Military desertion
Personal aggression
Political deviance
Property damage
Sabotage
Substance abuse
Theft

Job Performance as a Criterion
Based on the measures of criteria in Table 4.1, we see that performance can be defined as what people do or what people produce. Interestingly, although performance is one of the most central constructs in applied psychology, it may be defined in terms of behavior or results (DeNisi & Smith, 2014). For example, Campbell and Wiernik (2015), Aguinis (2019), and Beck, Beatty, and Sackett (2014) defined performance based on employee behaviors and actions—particularly those that are relevant to organizational goals. In fact, Campbell and Wiernik (2015) were quite forceful in their conclusion that “performance should be specified in behavioral terms as things that people do” (p. 67). By contrast, Minbashian and Luppino (2014), O’Boyle and Aguinis (2012), and Aguinis et al. (2016) defined performance in terms of results—what people produce. Given the coexistence of these definitions, it is not surprising that some researchers such as Viswesvaran and Ones (2000) defined performanceas both behavior and results, as follows: “scalable actions, behavior and outcomes that employees engage in or bring about that are linked with and contribute to organizational goals” (p. 216).
Although there are different proponents for the behavior- and results-based definitions, the two are clearly related. For example, behaviors such as exerting more effort at work (i.e., behavior-based performance) are likely to result in more and better outcomes (i.e., results-based performance). In fact, the empirical evidence shows that these two types of performance are distinct but also related at nontrivial levels (e.g., Beal, Cohen, Burke, & McLendon, 2003; Bommer, Johnson, Rich, Podsakoff, & MacKenzie, 1995). So, the question is not whether to define performance as behaviors or results, but when and why to define performance one way or another (or both).
Some of the proponents of the behavior-based definition of performance believe that there is, nevertheless, value in defining it as results. For example, although Beck et al. (2014) adopted the behavior-based approach, they clarified that a results-based definition “may indeed serve many useful organizational and research purposes” (p. 534). Specifically, Aguinis and O’Boyle (2014) chose the results-based definition of performance because “a focus on results rather than behaviors is most appropriate when (a) workers are skilled in the needed behaviors, (b) behaviors and results are obviously related, and (c) there are many ways to do the job right” (p. 316). These researchers and others have adopted a results-based definition also because it plays a central role regarding organizational-level outcomes (Boudreau & Jesuthasan, 2011; Cascio & Boudreau, 2011a). In other words, in terms of assessing firm performance, we are more interested in what employees produce than in how they produce these results.
The term ultimate criterion (Thorndike, 1949) describes the full domain of performance and includes everything—all behaviors and results—that ultimately define success on the job. Such a criterion is ultimate in the sense that one cannot look beyond it for any further standard by which to judge performance. The ultimate criterion of a salesperson’s performance must include, for example, the quality of customer interactions; time spent with customers; knowledge of products; total sales volume; total number of new accounts brought in during a particular time period; amount of customer loyalty built up by the salesperson; total amount of his or her influence on the morale or sales records of other company salespersons; and overall effectiveness in planning activities and calls, controlling expenses, and handling necessary reports and records. In short, the ultimate criterion is a concept that is strictly conceptual and, therefore, cannot be measured or observed; it embodies the notion of “true,” “total,” “long-term,” or “ultimate worth” to the employing organization.
Although the ultimate criterion is stated in broad terms that often are not susceptible to quantitative evaluation, it is an important construct because the relevance of any operational criterion measure and the factors underlying its selection are better understood if the conceptual stage is clearly and thoroughly documented (Astin, 1964).
Dimensionality of Criteria
Operational measures of the conceptual criterion may vary along several dimensions. In a classic article, Ghiselli (1956) identified three different types of criterion dimensionality: static, dynamic, and individual dimensionality. We examine each of these three types of dimensionality next.
Static Dimensionality
If we observe job performance at any single point in time, we find that it is multidimensional in nature. This type of multidimensionality refers to two issues: (1) Individuals may be high on one performance facet and simultaneously low on another, and (2) a distinction is needed between maximum and typical performance.
Regarding the various performance facets, Rush (1953) found that a number of relatively independent skills are involved in selling. Thus, a salesperson’s learning aptitude (as measured by sales school grades and technical knowledge) is unrelated to objective measures of his or her achievement (such as average monthly volume of sales or percentage of quota achieved), which, in turn, is independent of the salesperson’s general reputation (e.g., planning of work, rated potential value to the firm), which, in turn, is independent of his or her sales techniques (e.g., sales approaches, interest and enthusiasm).
In broader terms, we can consider two general facets of performance: task performance and contextual performance (Borman & Motowidlo, 1997). Contextual performance has also been labeled pro-social behaviorsor organizational citizenship performance (Borman, Brantley, & Hanson, 2014). Task performance and contextual performance do not necessarily go hand in hand (Bergman, Donovan, Drasgow, Overton, & Henning, 2008). An employee can be highly proficient at her task, but be an underperformer with regard to contextual performance (Bergeron, 2007). Task performance is defined as (a) activities that transform raw materials into the goods and services that are produced by the organization and (b) activities that help with the transformation process by replenishing the supply of raw materials; distributing its finished products; or providing important planning, coordination, supervising, or staff functions that enable it to function effectively and efficiently (Cascio & Aguinis, 2001). Contextual performance is defined as those behaviors that contribute to the organization’s effectiveness by providing a good environment in which task performance can occur. Contextual performance includes behaviors such as the following:
· Persisting with enthusiasm and exerting extra effort as necessary to complete one’s own task activities successfully (e.g., being punctual and rarely absent, expending extra effort on the job)
· Volunteering to carry out task activities that are not formally part of the job (e.g., suggesting organizational improvements, making constructive suggestions)
· Helping and cooperating with others (e.g., assisting and helping coworkers and customers)
· Following organizational rules and procedures (e.g., following orders and regulations, respecting authority, complying with organizational values and policies)
· Endorsing, supporting, and defending organizational objectives (e.g., exhibiting organizational loyalty, representing the organization favorably to outsiders)
Researchers have more recently identified what some consider to be the “dark side” of contextual performance, often labeled workplace deviance or counterproductive behaviors (Marcus, Taylor, Hastings, Sturm, & Weigelt, 2016; Spector et al., 2006). Although contextual performance and workplace deviance are seemingly at the opposite ends of the same continuum, evidence suggests that they are distinct from each other (Judge, LePine, & Rich, 2006; Kelloway, Loughlin, Barling, & Nault, 2002). In general, workplace deviance is defined as voluntary behavior that violates organizational norms and thus threatens the well-being of the organization, its members, or both (Robinson & Bennett, 1995). Vardi and Weitz (2004) identified over 100 such “organizational misbehaviors” (e.g., alcohol/drug abuse, belittling opinions, breach of confidentiality), and several scales are available to measure workplace deviance based on self- and other reports (Bennett & Robinson, 2000; Blau & Andersson, 2005; Hakstian, Farrell, & Tweed, 2002; Kelloway et al., 2002; Marcus, Schuler, Quell, & Hümpfner, 2002; Spector et al., 2006; Stewart, Bing, Davison, Woehr, & McIntyre, 2009). Some of the self-reported deviant behaviors measured by these scales are the following:
· Exaggerating hours worked
· Falsifying a receipt to get reimbursed for more money than was spent on business expenses
· Starting negative rumors about the company
· Gossiping about coworkers
· Covering up one’s mistakes
· Competing with coworkers in an unproductive way
· Gossiping about one’s supervisor
· Staying out of sight to avoid work
· Taking company equipment or merchandise
· Blaming one’s coworkers for one’s mistakes
· Intentionally working slowly or carelessly
· Being intoxicated during working hours
· Seeking revenge on coworkers
· Presenting colleagues’ ideas as if they were one’s own
Regarding the typical versus maximum performance distinction, typical performance refers to the average level of an employee’s performance, whereas maximum performance refers to the peak level of performance an employee can achieve (DuBois, Sackett, Zedeck, & Fogli, 1993; Sackett, Zedeck, & Fogli, 1988). Employees are more likely to perform at maximum levels when they understand they are being evaluated, when they accept instructions to maximize performance on the task, and when the task is of short duration. A meta-analysis that included 42 studies and a total sample size of 4,129 workers found that the average observed correlation between measures of maximum performance (i.e., what employees can do) with measures of typical performance (i.e., what employees will do) is only .33 (Beus & Whitman, 2012). The distinction between maximum and typical performance is a fairly new development, so research regarding this topic is still nascent. Nevertheless, based on the few studies available, the meta-analytic evidence shows that general mental ability is more strongly correlated to maximum performance (r = .25) than typical performance (r = .16).
Unfortunately, research and HR practices on criteria frequently ignore the fact that job performance often includes many facets that are relatively independent, such as task and contextual performance and the important distinction between typical and maximum performance. Because of this, employee performance is often not captured and described adequately. In addition, to capture the performance domain in a more exhaustive manner, researchers should also pay attention to the temporal dimensionality of criteria.
Dynamic or Temporal Dimensionality
Once we have defined clearly our conceptual criterion, we must then specify and refine operational measures of criterion performance (i.e., the measures actually to be used). Regardless of the operational form of the criterion measure, it must be taken at some point in time. When is the best time for criterion measurement? Optimum times vary greatly from situation to situation, and conclusions therefore need to be couched in terms of when criterion measurements were taken. Far different results may occur depending on when criterion measurements were taken (Weitz, 1961), and failure to consider the temporal dimension may lead to misinterpretations.
In predicting the short- and long-term success and survival of life insurance agents, for example, ability as measured by standardized tests is significant in determining early sales success, but interests and personality factors play a more important role later on (Ferguson, 1960). The same is true for accountants (Bass & Barrett, 1981). Thus, after two years as a staff accountant with one of the major accounting firms, interpersonal skills with colleagues and clients are more important than pure technical expertise for continued success. In short, criterion measurements are not independent of time.
Temporal dimensionality is a broad concept because criteria may be “dynamic” in three distinct ways: (1) changes over time in average levels of group performance, (2) changes over time in validity coefficients, and (3) changes over time in the rank ordering of scores on the criterion (Barrett, Caldwell, & Alexander, 1985).
Regarding changes in group performance over time, Ghiselli and Haire (1960) followed the progress of a group of investment salespeople for 10 years. During this period, they found a 650% improvement in average productivity, and still there was no evidence of leveling off! However, this increase was based only on those salespeople who survived on the job for the full 10 years; it was not true of all salespeople in the original sample. To be able to compare the productivity of the salespeople, their experience must be the same, or else it must be equalized in some manner (Ghiselli & Brown, 1955). Indeed, a considerable amount of other research evidence cited by Barrett, Caldwell, and Alexander (1985) does not indicate that average productivity improves significantly over lengthy time spans.
Criteria also might be dynamic if the relationship between predictor (e.g., preemployment test scores) and criterion scores (e.g., supervisory ratings) fluctuates over time (e.g., Jansen & Vinkenburg, 2006). Bass (1962) found this to be the case in a 42-month investigation of salespeople’s rated performance. He collected scores on three ability tests, as well as peer ratings on three dimensions, for a sample of 99 salespeople. Semiannual supervisory merit ratings served as criteria. The results showed patterns of validity coefficients for both the tests and the peer ratings that appeared to fluctuate erratically over time. However, he reached a much different conclusion when he tested the validity coefficients statistically. He found no significant differences for the validities of the ability tests, and when peer ratings were used as predictors, only 16 out of 84 pairs of validity coefficients (roughly 20%) showed a statistically significant difference (Barrett et al., 1985).
Researchers have suggested two hypotheses to explain why validities might change over time. One, the changing task model, suggests that although the relative amounts of ability possessed by individuals remain stable over time, criteria for effective performance might change in importance. Hence, the validity of predictors of performance also might change. The second model, known as the changing subjects model, suggests that although specific abilities required for effective performance remain constant over time, each individual’s level of skills and ability changes over time, and that is why validities might fluctuate (Henry & Hulin, 1987). Neither model has received unqualified support.
The third type of criteria dynamism addresses possible changes in the rank ordering of scores over time. This form of dynamic criteria has attracted substantial attention (e.g., Hofmann, Jacobs, & Baratta, 1993; Hulin, Henry, & Noon, 1990) because of the implications for the conduct of validation studies and personnel selection in general. If the rank ordering of individuals on a criterion changes over time, future performance becomes a moving target. Under those circumstances, it becomes progressively more difficult to predict performance accurately the farther out in time from the original assessment. Do performance levels show systematic fluctuations across individuals? The answer seems to be yes because the preponderance of evidence suggests that prediction deteriorates over time (Keil & Cortina, 2001). Overall, correlations among performance measures collected over time show what is called a “simplex” pattern of higher correlations among adjacent pairs and lower correlations among measures taken at greater time intervals (e.g., the correlation between month 1 and month 2 is greater than the correlation between month 1 and month 5) (Steele-Johnson, Osburn, & Pieper, 2000).
Deadrick and Madigan (1990) collected weekly performance data from three samples of sewing machine operators (i.e., a routine job in a stable work environment). Results showed the simplex pattern such that correlations between performance measures over time were smaller when the time lags increased. Deadrick and Madigan concluded that relative performance is not stable over time. A similar conclusion was reached by Hulin et al. (1990), Hofmann et al. (1993), and Keil and Cortina (2001): Individuals seem to change their rank order of performance over time (see Figure 4.1). In other words, there are meaningful differences in intraindividual patterns of changes in performance across individuals, and these differences are also likely to be reflected in how individuals evaluate the performance of others (Reb & Cropanzano, 2007).

Figure 4.1 Regression Lines for Three Ordinary Least Squares Clusters of Insurance Agents—Low, Moderate, and High Performers—Over Three Years

Source: Hoffman, D. A., Jacobs, R., & Baratta, J. E. (1993). Dynamic criteria and the measurement of change. Journal of Applied , 78, 194–204.
The recent development of wearable sensors and other technological advancements allows applied researchers to capture individuals’ fluctuations in performance over time more precisely (Chaffin et al., 2017; Tomczak, Lanzo, & Aguinis, 2018). Wearable sensors are mobile devices containing electronic components that are able to gather real-time data on the device-bearing person and his or her context (similar to devices such as Fitbit and Jawbone). For example, individuals can carry a smartphone fitted with a microphone and Bluetooth modules that can generate data including ambient sound and proximity to other devices and interactions with other people (e.g., customers, coworkers). A wearable sensor can also record an employee’s location via GPS. Also, individuals can receive a text message asking them to answer questions about where they are and what they are doing using their smartphones—and that information can be correlated with physiological markers (e.g., blood pressure, heart rate) and affective and attitudinal variables (e.g., satisfaction, emotions) (Beal, 2015).
Taken together, these technologies allow organizations to implement employee monitoring systems that capture employee performance on an ongoing basis and can capture performance fluctuations, and possible reasons and outcomes of those fluctuations, on a monthly, weekly, daily, and even hourly basis. Although still in their infancy, these technological advances are opening up entire new research avenues regarding what is labeled intraindividual performance fluctuations, or a within-person performance analysis. Wearable sensors allow for the collection of big data (i.e., as has been done in other fields such as computer science and genomics) that was unthinkable just a few years ago (Harlow & Oswald, 2016). These new types of measures can help us understand whether fluctuations, for example, in daily organizational citizenship behaviors, are related to factors related to the work per se or to the job’s social environment (e.g., Spence, Ferris, Brown, & Heller, 2012).
Overall, a major conclusion from recent research efforts is that within-person variability in performance is not necessarily the result of faulty measures (Dalal, Bhave, & Fiset, 2014). Accordingly, an important question posed by these findings is, How can we possibly predict performance if it is a moving target? The answer is that, in predicting performance, it is necessary to take the time dimension into account. Specifically, the goal then is to predict performance within a prespecified time span. For example, there is a need to understand which types of measures predict short- versus mid- versus long-term performance.
Individual Dimensionality
It is possible that individuals performing the same job may be considered equally good, yet the nature of their contributions to the organization may be quite different. Thus, different criterion dimensions should be used to evaluate them. Kingsbury (1933) recognized this problem almost 90 years ago when he wrote:
Some executives are successful because they are good planners, although not successful directors. Others are splendid at coordinating and directing, but their plans and programs are defective. Few executives are equally competent in both directions. Failure to recognize and provide, in both testing and rating, for this obvious distinction is, I believe, one major reason for the unsatisfactory results of most attempts to study, rate, and test executives. Good tests of one kind of executive ability are not good tests of the other kind. (p. 123)
Although in the managerial context described by Kingsbury there is only one job, it might plausibly be argued that in reality there are two (i.e., directing and planning). The two jobs are qualitatively different only in a psychological sense. In fact, the study of individual criterion dimensionality is a useful means of determining whether the same job, as performed by different people, is psychologically the same or different.
Challenges in Criterion Development
Competent criterion research is one of the most pressing needs of personnel psychology today—as it has been in the past. Stuit and Wilson (1946) demonstrated that continuing attention to the development of better performance measures results in better predictions of performance. The validity of these results has not been dulled by time (Viswesvaran & Ones, 2000). In this section, therefore, we consider three types of challenges faced in the development of criteria, point out potential pitfalls in criterion research, and sketch a logical scheme for criterion development.
At the outset, it is important to set certain “chronological priorities.” First, criteria must be developed and analyzed, for only then can predictors be constructed or selected to predict relevant criteria. Far too often, unfortunately, predictors are selected carefully, followed by a hasty search for “predictable criteria.” To be sure, if we switch criteria, the validities of the predictors will change, but the reverse is hardly true. Pushing the argument to its logical extreme, if we use predictors with no criteria, we will never know whether or not we are selecting those individuals who are most likely to succeed. Observe the chronological priorities! At least in this process we know that the chicken comes first and then the egg follows.
In the sections that follow, we address four basic challenges in criterion development (Ronan & Prien, 1966, 1971): reliability of performance, reliability of performance observation, dimensionality of performance, and modification of performance by situational characteristics. Let’s consider the first three in turn. The fourth is the focus of the section “Performance and Situational Characteristics.”
Challenge #1: Job Performance (Un)Reliability
Job performance reliability is a fundamental consideration in HR research and practice, and its assumption is implicit in all predictive studies. Reliability in this context refers to the consistency or stability of job performance over time. Are the best (or worst) performers at time 1 also the best (or worst) performers at time 2? As noted in the previous section, the rank order of individuals based on job performance scores does not necessarily remain constant over time.
Thorndike (1949) identified two types of unreliability—intrinsic and extrinsic—that may shed some light on the issue. Intrinsic unreliability is due to personal inconsistency in performance, whereas extrinsic unreliability is due to sources of variability that are external to job demands or individual behavior. Examples of the latter include variations in weather conditions (e.g., for outside construction work); unreliability due to machine downtime; and, in the case of interdependent tasks, delays in supplies, assemblies, or information. Much extrinsic unreliability is due to careless observation or poor control.
Faced with all of these potential confounding factors, what can be done? One solution is to aggregate (average) behavior over situations or occasions, thereby canceling out the effects of incidental, uncontrollable factors. To illustrate this, Epstein (1979, 1980) conducted four studies, each of which sampled behavior on repeated occasions over a period of weeks. Data in the four studies consisted of self-ratings, ratings by others, objectively measured behaviors, responses to personality inventories, and psychophysiological measures such as heart rate. The results provided unequivocal support for the hypothesis that stability can be demonstrated over a wide range of variables so long as the behavior in question is averaged over a sufficient number of occurrences. Once adequate performance reliability was obtained, evidence for validity emerged in the form of statistically significant relationships among variables. Similarly, Martocchio, Harrison, and Berkson (2000) found that increasing aggregation time enhanced the size of the validity coefficient between the predictor, employee lower-back pain, and the criterion, absenteeism.
Two further points bear emphasis. One, there is no shortcut for aggregating over occasions or people. In both cases, it is necessary to sample adequately the domain over which one wishes to generalize. Two, whether aggregation is carried out within a single study or over a sample of studies, it is not a panacea. Certain systematic effects, such as sex, race, or attitudes of raters, may bias an entire group of studies (Rosenthal & Rosnow, 1991). Examining large samples of studies through the techniques of meta-analysis (see Chapter 7) is one way of detecting the existence of such variables.
It also seems logical to expect that broader levels of aggregation might be necessary in some situations but not in others. Specifically, Rambo, Chomiak, and Price (1983) examined what Thorndike (1949) labeled extrinsic unreliability and showed that the reliability of performance data is a function both of task complexity and of the constancy of the work environment. These factors, along with the general effectiveness of an incentive system (if one exists), interact to create the conditions that determine the extent to which performance is consistent over time. Rambo et al. (1983) obtained weekly production data over a three-and-a-half-year period from a group of women who were sewing machine operators and a group of women in folding and packaging jobs. Both groups of operators worked under a piece-rate payment plan. Median correlations in week-to-week (not day-to-day) output rates were sewing = .94; nonsewing = .98. Among weeks separated by one year, they were sewing = .69; nonsewing = .86. Finally, when output in week 1 was correlated with output in week 178, the correlations obtained were still high: sewing = .59; nonsewing = .80. These are extraordinary levels of consistency, indicating that the presence of a production-linked wage incentive, coupled with stable, narrowly routinized work tasks, can result in high levels of consistency in worker productivity. Those individuals who produced much (little) initially also tended to produce much (little) at a later time. More recent results for a sample of foundry chippers and grinders paid under an individual incentive plan over a six-year period were generally consistent with those of the Rambo et al. (1983) study (Vinchur, Schippmann, Smalley, & Rothe, 1991), although there may be considerable variation in long-term reliability as a function of job content.

Challenge #2: Reliability of Job Performance Observation
The issue of reliability of job performance observation is crucial in prediction because all evaluations of performance usually depend on observation of one sort or another, but different methods of observing performance may lead to markedly different conclusions, as was shown by Bray and Campbell (1968). In an attempt to validate assessment center predictions of future sales potential, 78 men were hired as salespeople, regardless of their performance at the assessment center (we discuss the topic of the assessment center in detail in Chapter 13). Predictions then were related to field performance six months later. Field performance was assessed in two ways. In the first method, a trained independent auditor accompanied each man in the field on as many visits as were necessary to determine whether he did or did not meet accepted standards in conducting his sales activities. The field reviewer was unaware of any judgments made of the candidates at the assessment center. In the second method, each individual was rated by his sales supervisor and his trainer from sales training school. Both the supervisor and the trainer also were unaware of the assessment center predictions.
Although assessment center predictions correlated .51 with field performance ratings, there were no significant relationships between assessment center predictions and either supervisors’ ratings or trainers’ ratings. Additionally, there were no significant relationships between the field performance ratings and the supervisors’ or trainers’ ratings! The lesson to be drawn from this study is obvious: The study of reliability of performance becomes possible only when the reliability of judging performance is adequate (Ryans & Fredericksen, 1951). Unfortunately, although we know that the problem exists, there is no silver bullet that will improve the reliability of judging performance (Borman & Hallam, 1991). We examine this issue in greater detail, including some promising new approaches, in Chapter 5.
Challenge #3: Dimensionality of Job Performance
Even the most cursory examination of HR practices reveals a great variety of predictors typically in use. Several reviews (Campbell & Wiernik, 2015; Ronan & Prien, 1966, 1971) concluded that the notion of a unidimensional measure of job performance (even for lower level jobs) is unrealistic. Analyses of even single measures of job performance (e.g., attitude toward the company, absenteeism) have shown that they are much more complex than surface appearance would suggest. Despite the problems associated with global criteria, they seem to “work” quite well in most personnel selection situations. However, to the extent that one needs to solve a specific problem (e.g., too many customer complaints about product quality), a more specific criterion is needed. If there is more than one specific problem, then more than one specific criterion is called for (Guion, 1987).

Performance and Situational Characteristics
Most people would agree readily that individual levels of performance may be affected by conditions surrounding the performance. Yet most research investigations are conducted without regard for possible effects of variables other than those measured by predictors. In this section, therefore, we examine six possible extraindividual influences on performance. Taken together, the discussion of these influences is part of what Cascio and Aguinis (2008b) defined as in situ performance: “the specification of the broad range of effects—situational, contextual, strategic, and environmental—that may affect individual, team, or organizational performance” (p. 146). A consideration of in situ performance involves context—situational opportunities and constraints that affect the occurrence and meaning of behavior in organizations—as well as functional relationships between variables.
Environmental and Organizational Characteristics
Both absenteeism and turnover have been related to a variety of environmental and organizational characteristics (Allen & Vardaman, 2017; Dineen, Noe, Shaw, Duffy, & Wiethoff, 2007; McEvoy & Cascio, 1987; Sun, Aryee, & , 2007). These include organizational factors (e.g., pay and promotion policies, human resources practices); interpersonal factors (e.g., group cohesiveness, friendship opportunities, satisfaction with peers or supervisors); job-related factors (e.g., role clarity, task repetitiveness, autonomy, responsibility); and personal factors (e.g., age, tenure, mood, family size). Shift work is another frequently overlooked variable (Barton, 1994; Staines & Pleck, 1984). Clearly, organizational characteristics can have wide-ranging effects on performance.
Environmental Safety
Injuries and loss of time may also affect job performance (Probst, Brubaker, & Barsotti, 2008). Factors such as a positive safety climate, a high management commitment, and a sound safety communications program that incorporates goal setting and knowledge of results tend to increase safe behavior on the job (Reber & Wallin, 1984) and conservation of scarce resources (cf. Siero, Boon, Kok, & Siero, 1989). These variables can be measured reliably (Zohar, 1980) and can then be related to individual performance. Overall, environmental safety is affected by factors originating at the individual and organizational levels (Hofmann, Burke, & Zohar, 2017).
Lifespace Variables
Lifespace variables measure important conditions that surround the employee both on and off the job. They describe the individual employee’s interactions with organizational factors, task demands, supervision, and conditions of the job. Vicino and Bass (1978) used four lifespace variables—task challenge on first job assignment, life stability, supervisor–subordinate personality match, and immediate supervisor’s success—to improve predictions of management success at Exxon. The four variables accounted for an additional 22% of the variance in success on the job over and above Exxon’s own prediction system based on aptitude and personality measures. The equivalent of a multiple R of .79 was obtained. Other lifespace variables, such as personal orientation, career confidence, cosmopolitan versus local orientation, and job stress, deserve further study (Cooke & Rousseau, 1983; Edwards & Van Harrison, 1993).

Job and Location
Schneider and Mitchel (1980) developed a comprehensive set of six behavioral job functions for the agency manager’s job in the life insurance industry. Using 1,282 managers from 50 companies, they examined the relationship of activity in these functions with five factors: origin of the agency (new versus established), type of agency (independent versus company controlled), number of agents, number of supervisors, and tenure of the agency manager. These five situational variables were chosen as correlates of managerial functions on the basis of their traditionally implied impact on managerial behavior in the life insurance industry. The most variance explained in a job function by a weighted composite of the five situational variables was 8.6% (i.e., for the general management function). Thus, over 90% of the variance in the six agency-management functions lies in sources other than the five variables used. Although situational variables have been found to influence managerial job functions across technological boundaries, the results of this study suggest that situational characteristics also may influence managerial job functions within a particular technology. Performance thus depends not only on job demands but also on other structural and contextual factors such as the policies and practices of particular companies.
Extraindividual Differences and Sales Performance
Cravens and Woodruff (1973) recognized the need to adjust criterion standards for influences beyond a salesperson’s control, and they attempted to determine the degree to which these factors explained variations in territory performance. In a multiple regression analysis using dollar volume of sales as the criterion, a curvilinear model yielded a corrected R2 of .83, with sales experience, average market share, and performance ratings providing the major portion of explained variation. This study is noteworthy because a purer estimate of individual job performance was generated by combining the effects of extraindividual influences (territory workload, market potential, company market share, and advertising effort) with two individual-difference variables (sales experience and rated sales effort).
Leadership
The effects of leadership and situational factors on morale and performance have been well documented (Detert, Treviño, Burris, & Andiappan, 2007; Srivastava, Bartol, & Locke, 2006). These studies, as well as those cited previously, demonstrate that variations in job performance are due to characteristics of individuals (age, sex, job experience, etc.), groups, and organizations (size structure, management behavior, etc.). Until we can begin to partition the total variability in job performance into intraindividual and extraindividual components, we should not expect predictor variables measuring individual differences to correlate appreciably with measures of performance that are influenced by factors not under an individual’s control.
Steps in Criterion Development
Given the previous discussion, we can now describe a five-step procedure for criterion development as outlined by Guion (1961):
1. Analysis of job and/or organizational needs.
2. Development of measures of actual behavior relative to expected behavior as identified in job and need analysis. These measures should supplement objective measures of organizational outcomes such as turnover, absenteeism, and production.
3. Identification of criterion dimensions underlying such measures by factor analysis, cluster analysis, or pattern analysis.
4. Development of reliable measures, each with high construct validity, of the elements so identified.
5. Determination of the predictive validity of each independent variable (predictor) for each one of the criterion measures, taking them one at a time.
In step 2, behavior data are distinguished from result-of-behavior data or organizational outcomes and it is recommended that behavior data supplement result-of-behavior data. In step 4, construct-valid measures are advocated. Construct validity is essentially a judgment that a test or other predictive device does, in fact, measure a specified attribute or construct to a significant degree and that it can be used to promote the understanding or prediction of behavior (Landy & Conte, 2016; Messick, 1995). These two poles, utility (i.e., in which the researcher attempts to find the highest and therefore most useful validity coefficient) versus understanding (in which the researcher advocates construct validity), have formed part of the basis for an enduring controversy in psychology over the relative merits of the two approaches. We examine this controversy in greater detail in the section “Composite Criterion Versus Multiple Criteria.”
Evaluating Criteria
How can we evaluate the usefulness of a given criterion? Let’s discuss each of three different yardsticks: relevance, sensitivity or discriminability, and practicality.
Relevance
The principal requirement of any criterion is its judged relevance (i.e., it must be logically related to the performance domain in question). As noted in Principles for the Validation and Use of Personnel Selection Procedures (Society for Industrial and Organization , 2018), “A relevant criterion is one that reflects the relative standing of employees with respect to an outcome critical to success in the focal work environment” (p. 14). Hence, it is essential that this domain be described clearly.
Indeed, the American Psychological Association (APA) Task Force on Employment Testing of Minority Groups (1969) specifically emphasized that the most appropriate (i.e., logically relevant) criterion for evaluating tests is a direct measure of the degree of job proficiency developed by an employee after an appropriate period of time on the job (e.g., six months to a year). To be sure, the most relevant criterion measure will not always be the most expedient or the cheapest. A well-designed work sample test or performance management system may require a great deal of ingenuity, effort, and expense to construct (e.g., Jackson, Harris, Ashton, McCarthy, & Tremblay, 2000).
Sensitivity or Discriminability
To be useful, any criterion measure also must be sensitive—that is, capable of discriminating between effective and ineffective employees. Suppose, for example, that quantity of goods produced is used as a criterion measure in a manufacturing operation. Such a criterion frequently is used inappropriately when, because of machine pacing, everyone doing a given job produces about the same number of goods. Under these circumstances, there is little justification for using quantity of goods produced as a performance criterion, since the most effective workers do not differ appreciably from the least effective workers. Perhaps the amount of scrap or the number of errors made by workers would be a more sensitive indicator of real differences in job performance. Thus, the use of a particular criterion measure is warranted only if it reveals discriminable differences in job performance.
It is important to point out, however, that there is no necessary association between criterion variance and criterion relevance. A criterion element as measured may have low variance, but the implications in terms of a different scale of measurement, such as dollars, may be considerable (e.g., the dollar cost of industrial accidents). In other words, the utility to the organization of what a criterion measures may not be reflected in the way that criterion is measured. This highlights the distinction between operational measures and a conceptual formulation of what is important (i.e., has high utility and relevance) to the organization (Cascio & Valenzi, 1978).
Practicality
It is important that management be informed thoroughly of the real benefits of using carefully developed criteria. Management may or may not have the expertise to appraise the soundness of a criterion measure or a series of criterion measures, but objections will almost certainly arise if record keeping and data collection for criterion measures become impractical and interfere significantly with ongoing operations. Overzealous HR researchers sometimes view organizations as ongoing laboratories existing solely for their purposes. This should not be construed as an excuse for using inadequate or irrelevant criteria. Clearly a balance must be sought, for the HR department occupies a staff role, assisting through more effective use of human resources those who are concerned directly with achieving the organization’s primary goals of profit, growth, and/or service. Keep criterion measurement practical!
Criterion Deficiency
Criterion measures differ in the extent to which they cover the criterion domain. For example, the job of university professor includes tasks related to teaching, research, and service. If job performance is measured using indicators of teaching and service only, then the measures are deficient because they fail to include an important component of the job. Similarly, if we wish to measure a manager’s flexibility, adopting a trait approach only would be deficient because managerial flexibility is a higher order construct that reflects mastery of specific and opposing behaviors in two domains: social/interpersonal and functional/organizational (Kaiser, Lindberg, & Craig, 2007).
The importance of considering criterion deficiency was highlighted by a study examining the economic utility of companywide training programs addressing managerial and sales/technical skills (Morrow, Jarrett, & Rupinski, 1997). The economic utility of training programs may differ not because of differences in the effectiveness of the programs per se, but because the criterion measures may differ in breadth. In other words, the amount of change observed in an employee’s performance after she attends a training program will depend on the percentage of job tasks measured by the evaluation criteria. A measure including only a subset of the tasks learned during training will underestimate the value of the training program.
Criterion Contamination
When criterion measures are gathered carelessly with no checks on their worth before use either for research purposes or in the development of HR policies, they are often contaminated. Maier (1988) demonstrated this in an evaluation of the aptitude tests used to make placement decisions about military recruits. The tests were validated against hands-on job performance tests for two Marine Corps jobs: radio repairer and auto mechanic. The job performance tests were administered by sergeants who were experienced in each specialty and who spent most of their time training and supervising junior personnel. The sergeants were not given any training on how to administer and score performance tests. In addition, they received little monitoring during the four months of actual data collection, and only a single administrator was used to evaluate each examinee. The data collected were filled with errors, although subsequent statistical checks and corrections made the data salvageable. Did the “clean” data make a difference in the decisions made? Certainly. The original data yielded validities of .09 and .17 for the two specialties. However, after the data were “cleaned up,” the validities rose to .49 and .37, thus changing the interpretation of how valid the aptitude tests actually were.
Criterion contamination occurs when the operational or actual criterion includes variance that is unrelated to the ultimate criterion. Contamination itself may be subdivided into two distinct parts, error and bias (Blum & Naylor, 1968). Error by definition is random variation (e.g., due to nonstandardized procedures in testing, individual fluctuations in feelings) and cannot correlate with anything except by chance alone. Bias, by contrast, represents systematic criterion contamination, and it can correlate with predictor measures.
Criterion bias is of great concern in HR research and practice because its potential influence is so pervasive. Brogden and Taylor (1950b) offered a concise definition:
A biasing factor may be defined as any variable, except errors of measurement and sampling error, producing a deviation of obtained criterion scores from a hypothetical “true” criterion score. (p. 161)
Because the direction of the deviation from the true criterion score is not specified, biasing factors may increase, decrease, or leave unchanged the obtained validity coefficient. Biasing factors vary widely in their distortive effect, but primarily this distortion is a function of the degree of their correlation with predictors. The magnitude of such effects must be estimated and their influence controlled either experimentally or statistically. Next, we discuss three important and likely sources of bias.
Bias Due to Knowledge of Predictor Information
One of the most serious contaminants of criterion data, especially when the data are in the form of ratings, is prior knowledge of or exposure to predictor scores. In the selection of executives, for example, the assessment center method (see Chapter 13) is a popular technique. If an individual’s immediate superior has access to the prediction of this individual’s future potential by the assessment center staff and if at a later date the superior is asked to rate the individual’s performance, the supervisor’s prior exposure to the assessment center prediction is likely to bias this rating. If the subordinate has been tagged as a “shooting star” by the assessment center staff and the supervisor values that judgment, he or she, too, may rate the subordinate as a “shooting star.” If the supervisor views the subordinate as a rival, dislikes him or her for that reason, and wants to impede his or her progress, the assessment center report could serve as a stimulus for a lower rating than is deserved. In either case—spuriously high or spuriously low ratings—bias is introduced and gives an unrealistic estimate of the validity of the predictor. Because this type of bias is by definition predictor correlated, it looks like the predictor is doing a better job of predicting than it actually is; yet the effect is illusory. The rule of thumb is this: Keep predictor information away from those who must provide criterion data!
Probably the best way to guard against this type of bias is to obtain all criterion data before any predictor data are released. Thus, in attempting to validate assessment center predictions, Bray and Grant (1966) collected data at an experimental assessment center, but these data had no bearing on subsequent promotion decisions. Eight years later the predictions were validated against a criterion of “promoted versus not promoted into middle management.” By carefully shielding the predictor information from those who had responsibility for making promotion decisions, a much “cleaner” validity estimate was obtained.
Bias Due to Group Membership
Criterion bias may also result from the fact that individuals belong to certain groups. In fact, sometimes explicit or implicit policies govern the hiring or promotion of these individuals. For example, some organizations tend to hire engineering graduates predominantly (or only) from certain schools. Similarly, we know of an organization that tends to promote people internally who also receive promotions in their military reserve units.
Studies undertaken thereafter that attempt to relate these biographical characteristics to subsequent career success will necessarily be biased. The same effects also will occur when a group sets artificial limits on how much it will produce.
Bias in Ratings
Supervisory ratings, the most frequently employed criteria (Aguinis, 2019; Lent, Aurbach, & Levin, 1971; Murphy & Cleveland, 1995), are susceptible to all the sources of bias in objective indices, as well as to others that are peculiar to subjective judgments (Thorndike, 1920). We discuss this problem in much greater detail in Chapter 5, but, for the present, it is important to emphasize that bias in ratings may be due to spotty or inadequate observation by the rater, unequal opportunity on the part of subordinates to demonstrate proficiency, personal biases or prejudices on the part of the rater, or an inability to distinguish and reliably rate different dimensions of job performance.
Composite Criterion Versus Multiple Criteria
Applied psychologists generally agree that job performance is multidimensional in nature and that adequate measurement of job performance requires multidimensional criteria. The next question is what to do about it. Should we combine the various criterion measures into a composite score, or should we treat each criterion measure separately? If we choose to combine the elements, what rule should we use to do so? As with the issue of utility versus understanding, both sides have had their share of vigorous proponents over the years. Let’s consider some of the arguments.
Composite Criterion
The basic contention of Brogden and Taylor (1950a), Thorndike (1949), Toops (1944), and Nagle (1953), the strongest advocates of the composite criterion, is that the criterion should provide a yardstick or overall measure of “success” or “value to the organization” of each individual. Such a single index is indispensable in decision making and individual comparisons, and even if the criterion dimensions are treated separately in validation, they must somehow be combined into a composite when a decision is required. Although the combination of multiple criteria into a composite is often done subjectively, a quantitative weighting scheme makes objective the importance placed on each criterion used to form the composite.
If a decision is made to form a composite based on several criterion measures, then the question is whether all measures should be given the same weight or not (Bobko, Roth, & Buster, 2007). Consider the possible combination of two measures reflecting customer service, one collected from external customers (i.e., those purchasing the products offered by the organization) and the other from internal customers (i.e., individuals employed in other units within the same organization). Giving these measures equal weight implies that the organization values both external and internal customer service equally. However, the organization may make the strategic decision to form the composite by giving 70% weight to external customer service and 30% weight to internal customer service. This strategic decision is likely to affect the validity coefficients between predictors and criteria. Specifically, Murphy and Shiarella (1997) conducted a computer simulation and found that 34% of the variance in the validity of a battery of selection tests was explained by the way in which measures of task and contextual performance were combined to form a composite performance score. In short, forming a composite requires careful consideration of the relative importance of each criterion measure.
Multiple Criteria
Advocates of multiple criteria contend that measures of demonstrably different variables should not be combined. As Cattell (1957) put it, “Ten men and two bottles of beer cannot be added to give the same total as two men and ten bottles of beer” (p. 11). Consider a study of military recruiters (Pulakos, Borman, & Hough, 1988). In measuring the effectiveness of the recruiters, the researchers found that selling skills, human relations skills, and organizing skills all were important and related to success. They also found, however, that the three dimensions were unrelated to each other—that is, the recruiter with the best selling skills did not necessarily have the best human relations skills or the best organizing skills. Under these conditions, combining the measures leads to a composite that not only is ambiguous but also is psychologically nonsensical. Guion (1961) brought the issue clearly into focus:
The fallacy of the single criterion lies in its assumption that everything that is to be predicted is related to everything else that is to be predicted—that there is a general factor in all criteria accounting for virtually all of the important variance in behavior at work and its various consequences of value. (p. 145)
Schmidt and Kaplan (1971) subsequently pointed out that combining various criterion elements into a composite does imply that there is a single underlying dimension in job performance, but it does not, in and of itself, imply that this single underlying dimension is behavioral or psychological in nature. A composite criterion may well represent an underlying economic dimension while being essentially meaningless from a behavioral point of view. Thus, Brogden and Taylor (1950a) argued that when all of the criteria are relevant measures of economic variables (dollars and cents), they can be combined into a composite, regardless of their intercorrelations.
Differing Assumptions
As Schmidt and Kaplan (1971) and Binning and Barrett (1989) have noted, the two positions differ in terms of (a) the nature of the underlying constructs represented by the respective criterion measures and (b) what they regard to be the primary purpose of the validation process itself. Let’s consider the first set of assumptions. Underpinning the arguments for the composite criterion is the assumption that the criterion should represent an economic rather than a behavioral construct. The economic orientation is illustrated in Brogden and Taylor’s (1950a) “dollar criterion:” “The criterion should measure the overall contribution of the individual to the organization” (p. 139). Brogden and Taylor argued that overall efficiency should be measured in dollar terms by applying cost accounting concepts and procedures to the employee’s individual job behaviors: “The criterion problem centers primarily on the quantity, quality, and cost of the finished product” (p. 141).
In contrast, advocates of multiple criteria (Dunnette, 1963a; Pulakos et al., 1988) argued that the criterion should represent a behavioral or psychological construct, one that is behaviorally homogeneous. Pulakos et al. (1988) acknowledged that a composite criterion must be developed when actually making employment decisions, but they also emphasized that such composites are best formed when their components are well understood.
Resolving the Dilemma
Clearly there are numerous possible uses of job performance and program evaluation criteria. In general, they may be used for research purposes or operationally as an aid in managerial decision making. When criteria are used for research purposes, the emphasis is on the psychological understanding of the relationship between various predictors and separate criterion dimensions, where the dimensions themselves are behavioral in nature. When used for managerial decision-making purposes—such as job assignment; promotion; capital budgeting; or evaluation of the cost effectiveness of recruitment, training, or advertising programs—criterion dimensions must be combined into a composite representing overall (economic) worth to the organization.
The resolution of the composite criterion versus multiple criteria dilemma essentially depends on the objectives. Both methods are legitimate for their own purposes. If the goal is increased psychological understanding of predictor–criterion relationships, then the criterion elements are best kept separate. If managerial decision making is the objective, then the criterion elements should be weighted, regardless of their intercorrelations, into a composite representing an economic construct of overall worth to the organization.
Criterion measures with theoretical relevance should not replace those with practical relevance, but rather should supplement or be used along with them. The goal, therefore, is to enhance utility and understanding.

Research Design and Criterion Theory
Traditionally, personnel psychologists were guided by a simple prediction model that sought to relate performance on one or more predictors with a composite criterion. Implicit intervening variables usually were neglected.
A more complete criterion model that describes the inferences required for the rigorous development of criteria was presented by Binning and Barrett (1989). The model is shown in Figure 4.2. Managers involved in employment decisions are most concerned about the extent to which assessment information will allow accurate predictions about subsequent job performance (Inference 9 in Figure 4.2). One general approach to justifying Inference 9 would be to generate direct empirical evidence that assessment scores relate to valid measurements of job performance. Inference 5 shows this linkage, which traditionally has been the most pragmatic concern to personnel psychologists. Indeed, the term criterion related has been used to denote this type of evidence. However, to have complete confidence in Inference 9, Inferences 5 and 8 must be justified. That is, a predictor should be related to an operational criterion measure (Inference 5), and the operational criterion measure should be related to the performance domain it represents (Inference 8).

Figure 4.2 A Modified Framework That Identifies the Inferences for Criterion Development

Source: Binning, J. F., & Barrett, G. V. Validity of personnel decisions: A conceptual analysis of the inferential and evidential biases. Journal of Applied , 74, 478–494. Copyright © 1989 American Psychological Association.

Note: Linkages in the figure begin with number 5 because earlier figures in the article used numbers 1–4 to show critical linkages in the theory-building process.
Performance domains are comprised of behavior–outcome units (Binning & Barrett, 1989). Outcomes (e.g., dollar volume of sales) are valued by an organization, and behaviors (e.g., selling skills) are the means to these valued ends. Thus, behaviors take on different values, depending on the value of the outcomes. This, in turn, implies that optimal description of the performance domain for a given job requires careful and complete representation of valued outcomes and the behaviors that accompany them. As we noted earlier, composite criterion models focus on outcomes, whereas multiple criteria models focus on behaviors. As Figure 4.2 shows, together they form a performance domain. This is why both are necessary and should continue to be used.
Inference 8 represents the process of criterion development. Usually it is justified by rational evidence (in the form of job analysis data) showing that all major behavioral dimensions or job outcomes have been identified and are represented in the operational criterion measure. In fact, work analysis (see Chapter 9) provides the evidential basis for justifying Inferences 7, 8, 10, and 11.
What personnel psychologists have traditionally implied by the term construct validity is tied to Inferences 6 and 7. That is, if it can be shown that a test (e.g., of reading comprehension) measures a specific construct (Inference 6), such as reading comprehension, that has been determined to be critical for job performance (Inference 7), then inferences about job performance from test scores (Inference 9) are, by logical implication, justified. Constructs are simply labels for behavioral regularities that underlie behavior sampled by the predictor, and, in the performance domain, by the criterion.
In the context of understanding and validating criteria, Inferences 7, 8, 10, and 11 are critical. Inference 7 is typically justified by claims, based on job analysis, that the constructs underlying performance have been identified. This process is commonly referred to as deriving job specifications. Inference 10, by contrast, represents the extent to which actual job demands have been analyzed adequately, resulting in a valid description of the performance domain. This process is commonly referred to as developing a job description. Finally, Inference 11 represents the extent to which the links between job behaviors and job outcomes have been verified. Again, job analysis is the process used to discover and to specify these links.
The framework shown in Figure 4.2 helps identify possible locations for what we have referred to as the criterion problem. This problem results from a tendency to neglect the development of adequate evidence to support Inferences 7, 8, and 10 and fosters a very shortsighted view of the process of validating criteria. It also leads predictably to two interrelated consequences: (1) the development of criterion measures that are less rigorous psychometrically than are predictor measures and (2) the development of performance criteria that are less deeply or richly embedded in the networks of theoretical relationships that are constructs on the predictor side. These consequences are unfortunate, for they limit the development of theories, the validation of constructs, and the generation of evidence to support important inferences about people and their behavior at work (Binning & Barrett, 1989). Conversely, the development of evidence to support the important linkages shown in Figure 4.2 will lead to better informed staffing decisions, better career development decisions, and, ultimately, more effective organizations.
Distribution of Performance and Star Performers
Although not referred to explicitly, there is an unspoken assumption that the distribution of criteria, and performance in particular, follows a normal, bell-shaped distribution. For example, as we discuss in detail in Chapter 14, calculations of utility (i.e., financial results of HR interventions) usually make this assumption. If the distribution of performance follows a normal distribution, this means that the majority of individuals are grouped toward the center, and there is a small minority of individuals who are very poor or very good performers (see Figure 4.3).

Figure 4.3 Normal and Heavy-Tailed Performance Distributions and Their Means (µ)

Source: Adapted from Aguinis, H., & Bradley, K. J. (2015). The secret sauce for organizational success: Managing and producing star performers. Organizational Dynamics, 44, 161–168.
Challenging this long-standing implicit assumption of normality, a recent stream of research has shown that, for many jobs and occupations, the performance distribution is heavy tailed (Aguinis et al., 2016; Joo, Aguinis, & Bradley, 2017; O’Boyle & Aguinis, 2012). Figure 4.3 includes a visual representation of the assumed normal (i.e., bell-shaped) performance distribution where the majority of scores fall close to the mean μ (i.e., the center of the distribution), with relatively few scores falling at either the low or the high extremes. Figure 4.3 also includes a heavy-tailed distribution, shown in the gray area, also with its own mean.

Figure 4.3 shows a critical difference between these two types of distributions. Specifically, under a heavy-tailed distribution, we expect to see many “star performers” (i.e., those very far to the right of the mean). By contrast, under a normal distribution, the presence of such extreme scores is considered an anomaly. In fact, if one assumes that performance is distributed normally, the presence of such extreme scores is something that needs to be “corrected” through data transformations or outlier-management techniques that could even involve deleting these “abhorrent data points” (Aguinis, Gottfredson, & Joo, 2013b). Also, Figure 4.3 shows that output (i.e., area under the curves) is such that under a heavy-tailed distribution, a minority of individuals are responsible for producing a disproportionate quantity of results. This is not the case in normal distributions.
Do performance distributions follow normal or heavy-tailed distributions? Consider the evidence based on more than 600,000 individual workers, including publications authored by more than 25,000 researchers across more than 50 scientific fields, as well as productivity metrics collected from movie directors, writers, musicians, athletes, bank tellers, call-center employees, grocery checkers, electrical fixture assemblers, and wirers. Results suggest that at least 75% of distributions follow heavy-tailed and not normal distributions (Joo et al., 2017).
Let’s illustrate these results more concretely, considering the generic Figure 4.3. If research performance data followed a normal distribution, there should be approximately 35 researchers with about 10 publications or more each (three standard deviations above the mean). In contrast, results showed that 460 individuals produced that high number of scientific publications. This number is more than 13 times as many as we would expect if the normal distribution were true. Now, consider results for about 3,300 artists who have been nominated for a Grammy award. Five of them would be expected to receive at least 10 nominations under a normal performance distribution. However, 64 artists have received more than 10 nominations. Although this is not true for all jobs and all occupations (Beck et al., 2014; Vancouver, Li, Weinhardt, Steel, & Purl, 2016), it seems that performance distributions are not normal in many, if not most, cases. Consequently, star performers should not be treated as extreme scores or data anomalies that should be “fixed.” Rather, the presence of heavy-tailed distributions has a number of important consequences for research in terms of understanding why and when star performers emerge and for practice in terms of how to produce star performers (Aguinis & Bradley, 2015; Aguinis & O’Boyle, 2014):
·
It is important to minimize situational constraints (i.e., ceiling constraints) faced by workers to allow for the emergence of star performers. For example, what resources are needed to facilitate the emergence of stars?
· Allow star performers to rotate across teams because this widens their network and takes full advantage of knowledge transfer to rising stars.
· Invest sufficient resources in star performers who are making clear contributions to an organization’s core strategic objectives.
· Retain stars by paying attention to their developmental network (e.g., employment opportunities for spouses and long-term contracting with a star’s subordinates).
· In times of financial challenges and budget cuts, pay special attention to star performers because once they leave, it will be difficult for an organization to recover. In fact, the departure of a star can create a downward spiral of production when average or even mediocre performers replace stars.
· Give star performers preferential treatment, but clearly articulate these perks to all workers and apply them fairly. In other words, anyone can receive the perks if he or she achieves a high level of performance.
· Invest disproportionate amount of resources into stars, which will likely generate greater overall output and create positive gains.
· The easiest way not to produce star performers is to use non-performance-based incentives, encourage limited pay dispersion, and implement longevity-based promotion decisions. Doing so emphasizes homogeneous employee performance.
In conclusion, if the performance distribution is heavy tailed and not a normal shape, then a minority of employees are responsible for a very large and disproportionate amount of results—be it dollars, publications, or Grammy nominations. Obviously, it is in the best interest of organizations not to ignore the presence of star performers but rather to actively attempt to recruit, produce, and retain these stars.

Evidence-Based Implications for Practice

· The effectiveness and future progress of our knowledge of HR interventions depend fundamentally on careful, accurate criterion measurement.
· It is important to conceptualize the job performance domain broadly and to consider job performance as in situ performance (i.e., the specification of the broad range of effects—situational, contextual, strategic, and environmental—that may affect individual, team, or organizational performance).
· The notion of criterion relevance requires prior theorizing and development of the dimensions that comprise the domain of performance.
· Organizations must first formulate clear ultimate objectives and then develop appropriate criterion measures that represent economic or behavioral constructs—these could involve behaviors and results. Criterion measures must pass the tests of relevance, sensitivity, and practicality.
· Conclusions reached can depend on (1) the particular criterion measures used, (2) the time of measurement, (3) conditions outside an individual’s control, and (4) distortions and biases inherent in the situation or the measuring instrument (human or otherwise).
· Because there may be many paths to success, a broader, richer schematization of job performance is needed.
· Star performers are responsible for a disproportionate quantity of results.

Chapter 5:

5 PERFORMANCE APPRAISAL AND MANAGEMENT

Wayne F. Cascio, Herman Aguinis

Learning Goals

By the end of this chapter, you will be able to do the following:
· 5.1 Design a performance management system that attains multiple purposes (e.g., strategic, communication, and developmental; organizational diagnosis, maintenance, and development)
· 5.2 Distinguish important differences between performance appraisal and performance management (i.e., ongoing process of performance development aligned with strategic goals), and confront the multiple realities and challenges of performance management systems
· 5.3 Design a performance management system that meets requirements for success (e.g., congruence with strategy, practicality, meaningfulness)
· 5.4 Take advantage of the multiple benefits of state-of-the-science performance management systems, and design systems that consider (a) advantages and disadvantages of using different sources of performance information, (b) situations when performance should be assessed at individual or group levels, and (c) multiple types of biases in performance ratings and strategies for minimizing them
· 5.5 Distinguish between objective and subjective measures of performance and advantages of using each type, and design performance management systems that include relative and absolute rating systems
· 5.6 Create graphic rating scales for jobs and different types of performance dimensions, including behaviorally anchored ratings scales (BARS)
· 5.7 Consider rater and ratee personal and job-related factors that affect appraisals, and design performance management systems that consider individual as well as team performance, along with training programs to improve rating accuracy
· 5.8 Place performance appraisal and management systems within a broader social, emotional, and interpersonal context. Conduct a performance appraisal and goal-setting interview.

Performance management is a “continuous process of identifying, measuring, and developing the performance of individuals and teams and aligning performance with the strategic goals of the organization” (Aguinis, 2019, p. 4). It is not a one-time event that takes place during the annual performance-review period. Rather, performance is assessed at regular intervals, and feedback is provided so that performance is improved on an ongoing basis. Performance appraisal is the systematic description of job-relevant strengths and weaknesses within and between employees or groups. It is a critical component of all performance management systems. Researchers and practitioners have been fascinated by how to measure and improve performance for decades; yet their overall inability to resolve definitively the knotty technical and interpersonal problems of performance appraisal and management has led one reviewer to term it the “Achilles heel” of human resource management (Heneman, 1975). This statement still applies today (DeNisi & Murphy, 2017). Supervisors and subordinates alike are intensely aware of the political and practical implications of the ratings and, in many cases, are acutely ill at ease during performance appraisal interviews. Despite these shortcomings, surveys of managers from both large and small organizations consistently show that they are unwilling to abandon performance management. For example, a survey of performance management systems and practices in 278 organizations across 15 countries found that about 90% use a company-sanctioned performance management system (Cascio, 2011).
Many treatments of performance management scarcely contain a hint of the emotional overtones, the human problems, so intimately bound up with it (Aguinis, 2019). Traditionally, researchers have placed primary emphasis on technical issues—for example, the advantages and disadvantages of various rating systems, sources of error, and problems of unreliability in performance observation and measurement (Aguinis & Pierce, 2008). To be sure, these are vitally important concerns. No less important, however, are the human issues involved, for performance management is not merely a technique—it is a process, a dialogue involving both people and data, and this process also includes social, motivational, and interpersonal aspects (Fletcher, 2001). In addition, performance management needs to be placed within the broader context of the organization’s vision, mission, and strategic priorities. A performance management system will not be successful if it is not linked explicitly to broader work unit and organizational goals.
In this chapter, we focus on both the measurement and the social/motivational aspects of performance management. As HR specialists, our task is to make the formal process as meaningful and workable as present research and knowledge will allow.

Purposes Served

Performance management systems that are designed and implemented well can serve several important purposes:
· Performance management systems serve a strategic purpose because they help link employee activities with the organization’s mission and goals. Well-designed performance management systems identify the behaviors and results needed to carry out the organization’s strategic priorities and maximize the extent to which employees exhibit the desired behaviors and produce the intended results.
· Performance management systems serve an important communication purpose because they allow employees to know how they are doing and what the organizational expectations are regarding their performance. They convey the aspects of work the supervisor and other organization stakeholders believe are important.
· Performance management systems can serve as bases for employment decisions — decisions to promote outstanding performers; to terminate marginal or low performers; to train, transfer, or discipline others; and to award merit increases (or no increases). In short, information gathered by the performance management system can serve as predictors and, consequently, as key inputs for administering a formal organizational reward and punishment system (Cummings, 1973), including promotional decisions.
· Data regarding employee performance can serve as criteria in HR research (e.g., in test validation).
· Performance management systems also serve a developmental purpose because they can help establish objectives for training programs based on concrete feedback. To improve performance in the future, an employee needs to know what his or her weaknesses were in the past and how to correct them in the future. Pointing out strengths and weaknesses is a coaching function for the supervisor; receiving meaningful feedback and acting on it constitute a motivational experience for the subordinate. Thus, performance management systems can serve as vehicles for personal development.
· Performance management systems can facilitate organizational diagnosis, maintenance, and development. Proper specification of performance levels, in addition to suggesting training needs across units and indicating necessary skills to be considered when hiring, is important for HR planning and HR evaluation. It also establishes the more general organizational requirement of ability to discriminate effective from ineffective performers. Appraising employee performance, therefore, represents the beginning of a process rather than an end product (Jacobs, Kafry, & Zedeck, 1980).
· Finally, performance management systems allow organizations to keep proper records to document HR decisions and legal requirements.

Realities and Challenges of Performance Management Systems

Independent of any organizational context, the implementation of performance management systems at work confronts organizations with five realities (Ghorpade & Chen, 1995):
· This activity is inevitable in all organizations, large and small, public and private, and domestic and multinational. Organizations need to know if individuals are performing competently, and, in the current legal climate, appraisals are essential features of an organization’s defense against challenges to adverse employment actions, such as terminations or layoffs.
· Appraisal is fraught with consequences for individuals (rewards and punishments) and organizations (the need to provide appropriate rewards and punishments based on performance).
· As job complexity increases, it becomes progressively more difficult, even for well-meaning appraisers, to assign accurate, merit-based performance ratings.
· When evaluating coworkers, there is an ever-present danger of the parties being influenced by the political consequences of their actions—rewarding allies and punishing enemies or competitors (Longenecker & Gioia, 1994).
· The implementation of performance management systems takes time and effort, and participants (those who rate performance and those whose performance is rated) must be convinced the system is useful and fair. Otherwise, the system may carry numerous negative consequences. For example, high-performing employees may quit, time and money may be wasted, and adverse legal consequences may result.
Overall, these five realities involve several political and interpersonal challenges. Political challenges stem from deliberate attempts by raters to enhance or to protect their self-interests when conflicting courses of action are possible. Political considerations are facts of organizational life (Westphal & Clement, 2008). Appraisals take place in an organizational environment that is anything but completely rational, straightforward, or dispassionate. It appears that achieving accuracy in appraisal is less important to managers than motivating and rewarding their subordinates. Many managers will not allow excessively accurate ratings to cause problems for themselves, and they attempt to use the appraisal process to their own advantage. Interpersonal challenges arise from the actual face-to-face encounter between subordinate and superior. Because of a lack of communication, employees may think they are being judged according to one set of standards when their superiors actually use different ones. Furthermore, supervisors often delay or resist making face-to-face appraisals. Rather than confronting substandard performers with low ratings, negative feedback, and below-average salary increases, supervisors often find it easier to “damn with faint praise” by giving average or above-average ratings to inferior performers (Benedict & Levine, 1988). Finally, some managers complain that formal performance appraisal interviews tend to interfere with the more constructive coaching relationship that should exist between superior and subordinate. They claim that appraisal interviews emphasize the superior position of the supervisor by placing him or her in the role of judge, which conflicts with the supervisor’s equally important roles of teacher and coach (Meyer, 1991).
This, then, is the performance management dilemma: It is widely accepted as a potentially useful tool, but political and interpersonal barriers often thwart its successful implementation. There is currently an intense debate in both research and practitioner circles on how to solve this dilemma. In recent years, some large organizations including Accenture, Deloitte, Microsoft, Gap, Inc., and Eli Lilly chose to abandon or substantially curtail their use of performance appraisal (Adler et al., 2016), but most of them later realized that appraisals are critical given the purposes listed earlier (Hunt, 2016).
Much of the research on appraisals has focused on measurement issues. This is important, but HR professionals may contribute more by improving the attitudinal and interpersonal components of performance appraisal systems, as well as their technical aspects. Let’s begin by considering the fundamental requirements for a best-in-class performance management system.

Fundamental Requirements of Successful Performance Management Systems

For any performance management system to be used successfully, it should have the following nine characteristics (Aguinis, 2019; Aguinis, Joo, & Gottfredson, 2011):
· Congruence with strategy: The system should measure and encourage behaviors that will help achieve organizational goals.
· Thoroughness: All employees should be evaluated, all key job-related responsibilities should be measured, and evaluations should cover performance for the entire time period included in any specific review.
· Practicality: The system should be available, plausible, acceptable, and easy to use, and its benefits should outweigh its costs.
· Meaningfulness: Performance measurement should include only matters under the employee’s control, appraisals should occur at regular intervals, the system should provide for continuing skill development of raters and ratees, results should be used for important HR decisions, and implementation of the system should be seen as an important part of everyone’s job.
· Specificity: The system should provide specific guidance to both raters and ratees about what is expected of them and also how they can meet these expectations.
· Discriminability: The system should allow for clear differentiation between effective and ineffective performance and performers.
· Reliability and validity: Performance scores should be consistent over time and across raters observing the same behaviors (see
Chapter 6
) and should not be deficient or contaminated (see
Chapter 4
).
· Inclusiveness: Successful systems allow for the active participation of raters and ratees, including in the design of the system (Kleingeld, Van Tuijl, & Algera, 2004). This includes allowing ratees to provide their own performance evaluations and to assume an active role during the appraisal interview, and allowing both raters and ratees an opportunity to provide input in the design of the system.
· Fairness and acceptability: Participants should view the process and outcomes of the system as being just and equitable.
Several studies have investigated these characteristics, which dictate the success of performance management systems (Cascio, 1982). For example, regarding meaningfulness, a study including 176 Australian government workers indicated that the system’s meaningfulness (i.e., perceived consequences of implementing the system) was an important predictor of the decision to adopt or reject a system (Langan-Fox, Waycott, Morizzi, & McDonald, 1998). Regarding inclusiveness, a meta-analysis of 27 studies, including 32 individual samples, found that the overall correlation between employee participation and employee reactions to the system (corrected for unreliability) was .61 (Cawley, Keeping, & Levy, 1998). Specifically, the benefits of designing a system in which ratees are given a “voice” included increased satisfaction with the system, increased perceived utility of the system, increased motivation to improve performance, and increased perceived fairness of the system (Cawley et al., 1998).
Taken together, the nine key characteristics indicate that performance appraisal should be embedded in the broader performance management system and that a lack of understanding of the context surrounding the appraisal is likely to result in a failed system. With that in mind, let’s consider the benefits of state-of-the-science performance management systems.

Benefits of State-of-the-Science Performance Management Systems
When performance management systems are implemented following the requirements described in the previous section, they can be a clear source of competitive advantage (Aguinis, Joo, & Gottfredson, 2011). Specifically, such state-of-the-science systems benefit employees, managers, and organizations. For example, as shown in Table 5.1, employees understand what is expected of them and learn about their own strengths and weaknesses, which is useful information for their own personal development. Similarly, managers obtain insights regarding their subordinates and are able to obtain more precise and differentiating information that is necessary for making administrative decisions (e.g., promotions, compensation decisions), as well as for creating personal development plans. Finally, organizations are able to implement policies that are fair, standardized, and acceptable. Overall, the way to solve the dilemma mentioned earlier is not to get rid of performance appraisal and management, but to implement systems following best-practice recommendations based on the available empirical evidence.

Table 5.1 Benefits From Implementing a State-of-the-Science Performance Management System

For employees

Increased self-esteem
Better understanding of the behaviors and results required of their positions
Better understanding of ways to maximize their strengths and minimize weaknesses

For managers

Development of a workforce with a heightened motivation to perform
Greater insight into their subordinates
Better differentiation between good and poor performers
Clearer communication to employees about employees’ performance

For organizations

Increased appropriateness of administrative actions
Reduction in employee misconduct
Better protection from lawsuits
Enhanced employee engagement

Source: Adapted from Aguinis, Joo, and Gottfredson (2011).
Who Shall Rate?
In view of the purposes served by performance management, who does the rating is important. In addition to being cooperative and trained in the techniques of rating, raters must have direct experience with, or firsthand knowledge of, the individual to be rated. In many jobs, individuals with varying perspectives have such firsthand knowledge. Following are descriptions of five of these perspectives that will help answer the question of who shall rate performance.
Immediate Supervisor
The supervisor is probably the person best able to evaluate each subordinate’s performance in light of the organization’s overall objectives. Since the supervisor is probably also responsible for reward (and punishment) decisions such as pay, promotion, and discipline, he or she must be able to tie effective (ineffective) performance to the employment actions taken. Inability to form such linkages between performance and punishment or reward is one of the most serious deficiencies of any performance management system.
However, in jobs such as teaching, law enforcement, or sales and in self-managed work teams, the supervisor may only rarely observe his or her subordinate’s performance directly. In addition, performance ratings provided by the supervisor may reflect not only whether an employee is helping advance organizational objectives but also whether the employee is contributing to goals valued by the supervisor, which may or may not be congruent with organizational goals (Hogan & Shelton, 1998). Moreover, if a supervisor has recently received a positive evaluation regarding his or her own performance, he or she is also likely to provide a positive evaluation regarding his or her subordinates (Latham, Budworth, Yanar, & Whyte, 2008). Fortunately, several other perspectives can be used to provide a fuller picture of the individual’s total performance.
Peers

Peer assessment refers to three of the more basic methods used by members of a well-defined group in judging each other’s job performance. These include peer nominations, most useful for identifying persons with extreme high or low levels of performance; peer rating, most useful for providing feedback; and peer ranking, best at discriminating various levels of performance from highest to lowest on each dimension.
Reviews of peer assessment methods reached favorable conclusions regarding the reliability, validity, and freedom from biases of this source of performance information (e.g., Kane & ler, 1978). However, some problems still remain. First, two characteristics of peer assessments appear to be related significantly and independently to user acceptance (McEvoy & Buller, 1987). Perceived friendship bias is related negatively to user acceptance, and use for developmental purposes is related positively to user acceptance. How do people react upon learning that they have been rated poorly (favorably) by their peers? Research in a controlled setting indicates that such knowledge has predictable effects on group behavior. Negative peer-rating feedback produces significantly lower perceived performance of the group, plus lower cohesiveness, satisfaction, and peer ratings on a subsequent task. Positive peer-rating feedback produces nonsignificantly higher values for these variables on a subsequent task (DeNisi, Randolph, & Blencoe, 1983). One possible solution that might simultaneously increase feedback value and decrease the perception of friendship bias is to specify clearly (e.g., using critical incidents) the performance criteria on which peer assessments are based. Results of the peer assessment may then be used in joint employee–supervisor reviews of each employee’s progress, prior to later administrative decisions concerning the employee.
A second problem with peer assessments is that they seem to include more common method variance than assessments provided by other sources. Method variance is the variance observed in a performance measure that is not relevant to the behaviors assessed, but instead is due to the method of measurement used (Brannick, Chan, Conway, Lance, & Spector, 2010; Conway, 2002). For example, Conway (1998) reanalyzed supervisor, peer, and self-ratings for three performance dimensions (i.e., altruism-local, conscientiousness, and altruism-distant) and found that the proportion of method variance for peers was .38, whereas the proportion of method variance for self-ratings was .22. This finding suggests that relationships among various performance dimensions, as rated by peers, can be inflated substantially due to common method variance (Conway, 1998).
Several data-analysis methods are available to estimate the amount of method variance present in a peer-assessment measure (Conway, 1998; Williams, Hartman, & Cavazotte, 2010). At the very least, the assessment of common method variance can provide HR researchers and practitioners with information regarding the extent of the problem. In addition, Podsakoff, MacKenzie, Lee, and Podsakoff (2003) proposed two types of remedies to address this problem:
· Procedural remedies: These include obtaining measures of the predictor and criterion variables from different sources; separating the measurement of the predictor and criterion variables (i.e., temporal, psychological, or methodological separation); protecting respondent anonymity, thereby reducing socially desirable responding; counterbalancing the question order; and improving scale items.
· Statistical remedies: These include utilizing Harman’s single-factor test (i.e., to determine whether all items load into one common underlying factor, as opposed to the various factors hypothesized); computing partial correlations (e.g., partialling out social desirability, general affectivity, or a general factor score); controlling for the effects of a directly measured latent methods factor; controlling for the effects of a single, unmeasured, latent method factor; implementing the correlated uniqueness model (i.e., where a researcher identifies the sources of method variance so that the appropriate pattern of measurement-error corrections can be estimated); and utilizing the direct-product model (i.e., which models trait-by-method interactions).
The overall recommendation is to follow all the procedural remedies listed here, but the statistical remedies to be implemented depend on the specific characteristics of the situation faced (Podsakoff et al., 2003).
Given our discussion thus far, peer assessments are probably best considered as only one element in a system that includes input from all sources that have unique information or perspectives to offer. Thus, the behaviors and outcomes to be assessed should be considered in the context of the groups and situations in which peer assessments are to be applied. It is impossible to specify, for all situations, the kinds of characteristics that peers are able to rate best.
Subordinates
Subordinates offer a somewhat different perspective on a manager’s performance. They know directly the extent to which a manager does or does not delegate, the extent to which he or she plans and organizes, the type of leadership style(s) he or she is most comfortable with, and how well he or she communicates. This is why subordinate ratings often provide information that accounts for variance in performance measures over and above other sources (Conway, Lombardo, & Sanders, 2001). This approach is used regularly by universities (students evaluate faculty) and sometimes by large corporations, where a manager may have many subordinates. In small organizations, however, considerable trust and openness are necessary before subordinate appraisals can pay off.
They can pay off, though. For example, a study in a public institution with about 2,500 employees that performs research, development, tests, and evaluation in South Korea provided evidence of the benefits of upward appraisals—particularly long-term benefits (Jhun, Bae, & Rhee, 2012). Functional managers received upward feedback once a year during a period of seven years. For purposes of the analysis, they were divided into low, medium, and high performers. Results showed that those in the low-performing group benefited the most. Moreover, when upward feedback was used for administrative rather than developmental purposes, the impact on performance improvement was even larger.
Subordinate ratings have been found to be valid predictors of subsequent supervisory ratings over two-, four-, and seven-year periods (McEvoy & Beatty, 1989). One reason for this may have been that multiple ratings on each dimension were made for each manager, and the ratings were averaged to obtain the measure for the subordinate perspective. Averaging has several advantages. First, averaged ratings are more reliable than single ratings. Second, averaging helps to ensure the anonymity of the subordinate raters. Anonymity is important; subordinates may perceive the process to be threatening, since the supervisor can exert administrative controls (salary increases, promotions, etc.). In fact, when the identity of subordinates is disclosed, inflated ratings of managers’ performance tend to result (Antonioni, 1994).
Any organization contemplating use of subordinate ratings should pay careful attention to the intended purpose of the ratings. Evidence indicates that ratings used for salary administration or promotion purposes may be more lenient than those used for guided self-development (Zedeck & Cascio, 1982). In general, subordinate ratings are of significantly better quality when used for developmental purposes rather than administrative purposes (Greguras, Robie, Schleicher, & Goff, 2003).
Self
It seems reasonable to have each individual judge his or her own job performance. On the positive side, we can see that the opportunity to participate in performance appraisal, especially if it is combined with goal setting, should improve the individual’s motivation and reduce his or her defensiveness during an appraisal interview. Research to be described later in this chapter clearly supports this view. On the negative side, comparisons with appraisals by supervisors, peers, and subordinates suggest that self-appraisals tend to show more leniency, less variability, more bias, and less agreement with the judgments of others (Atkins & Wood, 2002; Harris & Schaubroeck, 1988). This seems to be the norm in Western cultures. In Taiwan, however, modesty bias (self-ratings lower than those of supervisors) has been found (Farh, Dobbins, & Cheng, 1991), although this may not be the norm in all Eastern cultures (Barron & Sackett, 2008).
To some extent, idiosyncratic aspects of self-ratings may stem from the tendency of raters to base their scores on different aspects of job performance or to weight facets of job performance differently. Self- and supervisor ratings agree much more closely when both parties have a thorough knowledge of the appraisal system or process (Williams & Levy, 1992). In addition, self-ratings are less lenient when done for self-development purposes rather than for administrative purposes (Meyer, 1991). In addition, self-ratings of contextual performance are more lenient than peer ratings when individuals are high on self-monitoring (i.e., tending to control self-presentational behaviors) and social desirability (i.e., tending to attempt to make oneself look good) (Mersman & Donaldson, 2000). The situation is far from hopeless, however. To improve the validity of self-appraisals, consider four research-based suggestions (Campbell & Lee, 1988; Fox & Dinur, 1988; Mabe & West, 1982):
· Instead of asking individuals to rate themselves on an absolute scale (e.g., a scale ranging from “poor” to “average”), provide a relative scale that allows them to compare their performance with that of others (e.g., “below average,” “average,” “above average”). In addition, providing comparative information on the relative performance of coworkers promotes closer agreement between self-appraisal and supervisor rating (Farh & Dobbins, 1989).
· Provide multiple opportunities for self-appraisal, for the skill being evaluated may well be one that improves with practice.
· Provide reassurance of confidentiality—that is, that self-appraisals will not be “publicized.”
· Focus on the future—specifically on predicting future behavior.
Until the problems associated with self-appraisals can be resolved, however, they seem more appropriate for counseling and development than for employment decisions.
Clients Served
Another group that may offer a different perspective on individual performance in some situations is that of clients served. In jobs that require a high degree of interaction with the public or with particular individuals (e.g., purchasing managers, suppliers, and sales representatives), appraisal sometimes can be done by the consumers of the organization’s services. Although the clients served cannot be expected to identify completely with the organization’s objectives, they can, nevertheless, provide useful information. Such information may affect employment decisions (promotion, transfer, need for training), but it also can be used in HR research (e.g., as a criterion in validation studies or in the measurement of training outcomes on the job) or as a basis for self-development activities.
Appraising Performance: Individual Versus Group Tasks
So far, we have assumed that ratings are assigned on an individual basis. That is, each source—be it the supervisor, peer, subordinate, self, or client—makes the performance judgment individually and independently from other individuals. However, in practice, appraising performance is not strictly an individual task. A survey of 135 raters from six organizations indicated that 98.5% of raters reported using at least one secondhand (i.e., indirect) source of performance information (Raymark, Balzer, & De La Torre, 1999). In other words, supervisors often use information from outside sources in making performance judgments. Moreover, supervisors may change their own ratings in the presence of indirect information. For example, a study including participants with at least two years of supervisory experience revealed that supervisors are likely to change their ratings when the ratee’s peers provide information perceived as useful (Makiney & Levy, 1998). A follow-up study that included students from a Canadian university revealed that indirect information is perceived to be most useful when it is in agreement with the rater’s direct observation of the employee’s performance (Uggerslev & Sulsky, 2002). For example, when a supervisor’s judgment about a ratee’s performance is positive, positive indirect observation produced higher ratings than negative indirect information. In addition, it seems that the presence of indirect information is more likely to change ratings from positive to negative than from negative to positive (Uggerslev & Sulsky, 2002). In sum, although direct observation is the main influence on ratings, the presence of indirect information is likely to affect ratings.
If the process of assigning performance ratings is not entirely an individual task, might it pay off to formalize performance appraisals as a group task? One study found that groups are more effective than individuals at remembering specific behaviors over time, but that groups also demonstrate greater response bias (Martell & Borg, 1993). In a second related study, individuals observed a 14-minute military training videotape of five men attempting to build a bridge of rope and planks in an effort to get themselves and a box across a pool of water. Before observing the tape, study participants were given indirect information in the form of a positive or negative performance cue [i.e., “the group you will observe was judged to be in the top (bottom) quarter of all groups”]. Then ratings were provided individually or in the context of a four-person group (the group task required that the four group members reach consensus). Results showed that ratings provided individually were affected by the performance cue, but that ratings provided by the groups were not (Martell & Leavitt, 2002).
These results suggest that groups can be of help, but they are not a cure-all for the problems of rating accuracy. Groups can be a useful mechanism for improving the accuracy of performance appraisals under two conditions. First, the task needs to have a necessarily correct answer. For example, is the behavior present or not? Second, the magnitude of the performance cue should not be too large. If the performance facet in question is subjective (e.g., “what is the management potential for this employee?”) and the magnitude of the performance cue is large, group ratings may amplify instead of attenuate individual biases (Martell & Leavitt, 2002).
In summary, there are several sources of appraisal information, and each provides a different perspective, a different piece of the puzzle. The various sources and their potential uses are shown in Table 5.2. Several studies indicate that data from multiple sources (e.g., self, supervisors, peers, subordinates) are desirable because they provide a complete picture of the individual’s effect on others (Borman, White, & Dorsey, 1995; Murphy & Cleveland, 1995; Wohlers & London, 1989).

Table 5.2 Sources and Uses of Appraisal Data

Use

Source

Supervisor

Peers

Subordinates

Self

Clients Served

Employment decisions

–

Self-development

HR research

–

Putting It All Together: 360-Degree Systems
As is obvious by now, the different sources of performance information are not mutually exclusive. So-called 360-degree feedback systems broaden the base of appraisals by including input from self, peers, subordinates, and (in some cases) clients. Moreover, there are several advantages to using these systems compared to a single source of performance information (Campion, Campion, & Campion, 2015). First, 360-degree feedback systems result in improved reliability of performance information because it originates from multiple sources and not just one source. Second, they consider a broader range of performance information, which is particularly useful in terms of minimizing criterion deficiency (as discussed in Chapter 4). Third, they usually include information not only on task performance but also on contextual performance and counterproductive work behaviors, which are all important given the multidimensional nature of performance. Finally, because multiple sources and individuals are involved, 360-degree systems have great potential to decrease biases—particularly compared to systems involving a single source of information.
For such systems to be effective, however, it is important to consider the following issues (Bracken & Rose, 2011):
· Relevant content: The definition of success, no matter which is the source, needs to be clear and aligned with strategic organizational goals.
· Data credibility: Each source needs to be perceived as capable and able to assess the performance dimensions assigned to it.
· Accountability: Each participant in the system needs to be motivated to provide reliable and valid information—to the best of his or her ability.
· Participation: Successful systems are typically implemented organizationwide rather than in specific units. This type of implementation will also facilitate acceptance.
Agreement and Equivalence of Ratings Across Sources
To assess the degree of interrater agreement within rating dimensions (convergent validity) and to assess the ability of raters to make distinctions in performance across dimensions (discriminant validity), a matrix listing dimensions as rows and raters as columns might be prepared (ler, 1967). As we noted earlier, however, multiple raters for the same individual may be drawn from different organizational levels, and they probably observe different facets of a ratee’s job performance (Bozeman, 1997). This may explain, in part, why the overall correlation between subordinate and self-ratings (corrected for unreliability) is only .14, the correlation between subordinate and supervisor ratings (also corrected for unreliability) is .22 (Conway & Huffcutt, 1997), and the correlation between self and supervisory ratings is also only .22 (Heidemeier & Moser, 2009). Hence, having interrater agreement for ratings on all performance dimensions across organizational levels not only is an unduly severe expectation but also may be erroneous. Although we should not always expect agreement, we should expect that the construct underlying the measure used should be equivalent across raters. In other words, does the underlying trait measured across sources relate to observed rating scale scores in the same way across sources? In general, it does not make sense to assess the extent of interrater agreement without first establishing measurement equivalence (also called measurement invariance) because a lack of agreement may be due to a lack of measurement equivalence (Cheung, 1999). A lack of measurement equivalence means that the underlying characteristics being measured are not on the same psychological measurement scale, which implies that differences across sources are possibly artifactual, contaminated, or misleading (Maurer, Raju, & Collins, 1998).
Fortunately, there is evidence that measurement equivalence is present in many appraisal systems. Specifically, measurement equivalence was found in a measure of managers’ team-building skills as assessed by peers and subordinates (Maurer, Raju, & Collins, 1998). Equivalence was also found in a measure including 48 behaviorally oriented items designed to measure 10 dimensions of managerial performance as assessed by self, peers, supervisors, and subordinates (Facteau & Craig, 2001) and in a meta-analysis including measures of overall job performance, productivity, effort, job knowledge, quality, and leadership as rated by supervisors and peers (Viswesvaran, Schmidt, & Ones, 2002). However, lack of equivalence was found for measures of interpersonal competence, administrative competence, and compliance and acceptance of authority as assessed by supervisors and peers (Viswesvaran et al., 2002). At this point, it is not clear what may account for differential measurement equivalence across studies and constructs, and this is a fruitful avenue for future research. One possibility is that behaviorally based ratings provided for developmental purposes are more likely to be equivalent than those reflecting broader behavioral dimensions (e.g., interpersonal competence) and collected for research purposes (Facteau & Craig, 2001). One conclusion is clear, however: Measurement equivalence needs to be established before ratings can be assumed to be directly comparable. Several methods exist for this purpose, including those based on confirmatory factor analysis (CFA) and item response theory (Barr & Raju, 2003; Cheung & Rensvold, 1999, 2002; Maurer, Raju, & Collins, 1998; Vandenberg, 2002).
Once measurement equivalence has been established, we can assess the extent of agreement across raters. For this purpose, raters may use a hybrid multitrait–multirater analysis (see Figure 5.1), in which raters make evaluations only on those dimensions that they are in good position to rate (Borman, 1974) and that reflect measurement equivalence. In the hybrid analysis, within-level interrater agreement is taken as an index of convergent validity. The hybrid matrix provides an improved conceptual fit for analyzing performance ratings, and the probability of obtaining convergent and discriminant validity is probably higher for this method than for the traditional multitrait–multirater analysis.

Figure 5.1 Example of a Hybrid Matrix Analysis of Performance Ratings

Note: Level I rates only traits 1–4. Level II rates only traits 5–8.
Another approach for examining performance ratings from more than one source is based on CFA (Williams et al., 2010). CFA allows researchers to specify each performance dimension as a latent factor and assess the extent to which these factors are correlated with each other. In addition, CFA allows for an examination of the relationship between each latent factor and its measures, as provided by each source (e.g., supervisor, peer, self). One advantage of using a CFA approach to examine ratings from multiple sources is that it allows for a better understanding of source-specific method variance (i.e., the dimension-rating variance specific to a particular source).
Judgmental Biases in Rating
In the traditional view, judgmental biases result from some systematic measurement error on the part of a rater. As such, they are easier to deal with than errors that are unsystematic or random. However, each type of bias has been defined and measured in different ways in the literature. This may lead to diametrically opposite conclusions, even in the same study (Saal, Downey, & Lahey, 1980). In the minds of many managers, however, these behaviors are not errors at all. For example, in an organization in which a team-based culture exists, can we really say that if peers place more emphasis on contextual than task performance in evaluating others, this is an error that should be minimized or even eliminated (cf. Lievens, Conway, & De Corte, 2008)? Rather, this apparent error is really capturing an important contextual variable in this particular type of organization. With these considerations in mind, let’s consider some of the most commonly observed judgmental biases, along with ways of minimizing them.
Leniency and Severity
The use of ratings rests on the assumption that the human observer is capable of some degree of precision and some degree of objectivity (Guilford, 1954). His or her ratings are taken to mean something accurate about certain aspects of the person rated. “Objectivity” is the major hitch in these assumptions, and it is the one most often violated. Raters subscribe to their own sets of assumptions (that may or may not be valid), and most people have encountered raters who seemed either inordinately easy (lenient) or inordinately difficult (severe). Evidence also indicates that leniency is a stable response tendency across raters (Kane, Bernardin, Villanova, & Peyrfitte, 1995). Moreover, some raters are more lenient than others, even in situations where there is little or no contact between raters and ratees after the performance evaluation (Dewberry, Davies-Muir, & Newell, 2013).
Senior managers recognize that leniency is not to be taken lightly. Fully 77% of sampled Fortune 100 companies reported that lenient appraisals threaten the validity of their appraisal systems (Bretz, Milkovich, & Read, 1990). An important cause for lenient ratings is the perceived purpose served by the performance management system in place. A meta-analysis that included 22 studies and a total sample size of more than 57,000 individuals concluded that when ratings are to be used for administrative purposes, scores are one third of a standard deviation larger than those obtained when the main purpose is research (e.g., validation study) or employee development (Jawahar & Williams, 1997). This difference is even larger when ratings are made in field settings (as opposed to lab settings), provided by practicing managers (as opposed to students), and provided for subordinates (as opposed to superiors). In other words, ratings tend to be more lenient when they have real consequences in actual work environments.
Leniency and severity biases can be controlled or eliminated in several ways: (a) by allocating ratings into a forced distribution, in which ratees are apportioned according to an underlying distribution (e.g., 20% of As, 70% of Bs, and 10% of Cs); (b) by requiring supervisors to rank order their subordinates; (c) by encouraging raters to provide feedback on a regular basis, thereby reducing rater and ratee discomfort with the process; and (d) by increasing raters’ motivation to be accurate by holding them accountable for their ratings. For example, firms such as IBM, Pratt & Whitney, and Grumman implemented forced distributions because the extreme leniency in their ratings-based appraisal data hindered their ability to implement downsizing based on merit (Kane & Kane, 1993). Forced-distribution systems have their own disadvantages, however, as we describe later in this chapter.
Central Tendency
When political considerations predominate, raters may assign all their subordinates ratings that are neither too good nor too bad. They avoid using the high and low extremes of rating scales and tend to cluster all ratings about the center of all scales. “Everybody is average” is one way of expressing the central tendency bias. The unfortunate consequence, as with leniency or severity biases, is that most of the value of systematic performance appraisal is lost. The ratings fail to discriminate either within people over time or between people, and the ratings become virtually useless as managerial decision-making aids, as predictors, as criteria, or as a means of giving feedback.
Central tendency biases can be minimized by specifying clearly what the various anchors mean. In addition, raters must be convinced of the value and potential uses of merit ratings if they are to provide meaningful information.
Halo
Halo is perhaps the most actively researched bias in performance appraisal. A rater who is subject to the halo bias assigns ratings on the basis of a general impression of the ratee. An individual is rated either high or low on specific factors because of the rater’s general impression (good–poor) of the ratee’s overall performance (Lance, LaPointe, & Stewart, 1994). According to this theory, the rater fails to distinguish among levels of performance on different performance dimensions. Ratings subject to the halo bias show spuriously high positive intercorrelations (Cooper, 1981).
Two critical reviews of research in this area (Balzer & Sulsky, 1992; Murphy, Jako, & Anhalt, 1993) led to the following conclusions: (a) Halo is not as common as believed; (b) the presence of halo does not necessarily detract from the quality of ratings (i.e., halo measures are not strongly interrelated, and they are not related to measures of rating validity or accuracy); (c) it is impossible to separate true from illusory halo in most field settings; and (d) although halo may be a poor measure of rating quality, it may or may not be an important measure of the rating process. So, contrary to assumptions that have guided halo research since the 1920s, it is often difficult to determine whether halo has occurred, why it has occurred (whether it is due to the rater or to contextual factors unrelated to the rater’s judgment), or what to do about it. To address this problem, Solomonson and Lance (1997) designed a study in which true halo was manipulated as part of an experiment, and, in this way, they were able to examine the relationship between true halo and rater error halo. Results indicated that the effects of rater error halo were homogeneous across a number of distinct performance dimensions, although true halo varied widely. In other words, true halo and rater error halo are, in fact, independent. Therefore, the fact that performance dimensions are sometimes intercorrelated may not mean that there is rater bias but, rather, that there is a common, underlying general performance factor. Further research is needed to explore this potential generalized performance dimension.
As we noted earlier, judgmental biases may stem from a number of factors. One factor that has received considerable attention over the years has been the type of rating scale used. Each type attempts to reduce bias in some way. Although no single method is free of flaws, each has its own particular strengths and weaknesses. In the following section, we examine some of the most popular methods of evaluating individual job performance.
Types of Performance Measures
Objective Measures
Related to our discussion of performance as behaviors or results in Chapter 4, performance measures may be classified into two general types: objective and subjective. Objective performance measures include production data (dollar volume of sales, units produced, number of errors, amount of scrap) and employment data (accidents, turnover, absences, tardiness). Objective measures are usually, but not always, related to results. These variables directly define the goals of the organization and, therefore, sometimes are outside the employee’s control. For example, dollar volume of sales is influenced by numerous factors beyond a particular salesperson’s control—territory location, number of accounts in the territory, nature of the competition, distances between accounts, price and quality of the product, and so forth. This is why general cognitive ability scores predict ratings of sales performance quite well (i.e., r = .40) but not objective sales performance (i.e., r = .04) (Vinchur, Schippmann, Switzer, & Roth, 1998).
Although objective measures of performance are intuitively attractive, they carry theoretical and practical limitations. But, because correlations between objective and subjective measures are far from being perfectly correlated (r = .39; Bommer, Johnson, Rich, Podsakoff, & Mackenzie, 1995), objective measures can offer useful information.
Subjective Measures
The disadvantages of objective measures have led researchers and managers to place major emphasis on subjective measures of job performance, which depend on human judgment. Hence, they are prone to the kinds of biases that we discuss in Chapter 6. To be useful, they must be based on a careful analysis of the behaviors viewed as necessary and important for effective job performance.
There is enormous variation in the types of subjective performance measures used by organizations. Some organizations use a long list of elaborate rating scales, others use only a few simple scales, and still others require managers to write a paragraph or two concerning the performance of each of their subordinates. In addition, subjective measures of performance may be relative (in which comparisons are made among a group of ratees) or absolute (in which a ratee is described without reference to others). In the next section, we briefly describe alternative formats.

Rating Systems: Relative and Absolute
We can classify rating systems into two types: relative and absolute. Within this taxonomy, the following methods may be distinguished:

Relative (Employee Comparison)

Absolute

Rank ordering
Paired comparisons
Forced distribution

Essays
Behavioral checklists
Critical incidents
Graphic rating scales
Behaviorally anchored rating scales

Results of an experiment in which undergraduate students rated the videotaped performance of a lecturer suggest that no advantages are associated with the absolute methods (Wagner & Goffin, 1997). By contrast, relative ratings based on various rating dimensions (as opposed to a traditional global performance dimension) seem to be more accurate with respect to differential accuracy (i.e., accuracy in discriminating among ratees within each performance dimension) and stereotype accuracy (i.e., accuracy in discriminating among performance dimensions averaging across ratees). Given that the affective, social, and political factors influencing performance management systems were absent in this experiment conducted in a laboratory setting, view the results with caution. Moreover, a more recent study involving two separate samples found that absolute formats are perceived as fairer than relative formats (Roch, Sternburgh, & Caputo, 2007).
Because both relative and absolute methods are used pervasively in organizations, next we discuss each of these two types of rating systems in detail.
Relative Rating Systems (Employee Comparisons)
Employee comparison methods are easy to explain and are helpful in making employment decisions. They also provide useful criterion data in validation studies, for they effectively control leniency, severity, and central tendency bias. Like other systems, however, they suffer from several weaknesses that should be recognized.
Employees usually are compared only in terms of a single overall suitability category. The rankings, therefore, lack behavioral specificity and may be subject to legal challenge. In addition, employee comparisons yield only ordinal data—data that give no indication of the relative distance between individuals. Moreover, it is often impossible to compare rankings across work groups, departments, or locations. The last two problems can be alleviated, however, by converting the ranks to normalized standard scores that form an approximately normal distribution. An additional problem is related to reliability. Specifically, when asked to rerank all individuals at a later date, the extreme high or low rankings probably will remain stable, but the rankings in the middle of the scale may shift around considerably.
Rank Ordering

Simple ranking requires only that a rater order all ratees from highest to lowest, from “best” employee to “worst” employee. Alternation ranking requires that the rater initially list all ratees on a sheet of paper. From this list, the rater first chooses the best ratee (#1), then the worst ratee (#n), then the second best (#2), then the second worst (#n−1), and so forth, alternating from the top to the bottom of the list until all ratees have been ranked.
Paired Comparisons
Both simple ranking and alternation ranking implicitly require a rater to compare each ratee with every other ratee, but systematic ratee-to-ratee comparison is not a built-in feature of these methods. For this, we need paired comparisons. The number of pairs of ratees to be compared may be calculated from the formula [n(n−1)]/2. Hence, if 10 individuals were being compared, [10(9)]/2 or 45 comparisons would be required. The rater’s task is simply to choose the better of each pair, and each individual’s rank is determined by counting the number of times he or she was rated superior.
Forced Distribution
In forced-distribution systems, raters must distribute a predetermined percentage of employees into categories based on their performance relative to other employees. This type of system results in a clear differentiation among groups of employees and became famous after legendary GE CEO Jack Welch implemented what he labeled the “vitality curve,” in which supervisors identified the “top 20%,” “vital 70%,” and “bottom 10%” of performers within each unit. A recent literature review of the effects of forced-distribution systems concluded that they are particularly beneficial for jobs that are very autonomous (i.e., employees perform their duties without much interdependence) (Moon, Scullen, & Latham, 2016). However, the risks of forced-distribution systems outweigh their benefits for jobs that involve task interdependence and social support from others. Overall, Moon et al. (2016) recommended using forced-distribution systems to rate a limited subset of activities—those that involve independent work effort and those that can be measured using objective performance measures.
Absolute Rating Systems
Absolute rating systems enable a rater to describe a ratee without making direct reference to other ratees.
Essays
Perhaps the simplest absolute rating system is the narrative essay, in which the rater is asked to describe, in writing, an individual’s strengths, weaknesses, and potential and to make suggestions for improvement. The assumption underlying this approach is that a candid statement from a rater who is knowledgeable of a ratee’s performance is just as valid as more formal and more complicated appraisal methods.
The major advantage of narrative essays (when they are done well) is that they can provide detailed feedback to ratees regarding their performance. Drawbacks are that essays are almost totally unstructured, and they vary widely in length and content. Comparisons across individuals, groups, or departments are virtually impossible, since different essays touch on different aspects of ratee performance or personal qualifications. Finally, essays provide only qualitative information; yet, for the appraisals to serve as criteria or to be compared objectively and ranked for the purpose of an employment decision, some form of rating that can be quantified is essential. Behavioral checklists provide one such scheme.
Behavioral Checklists
When using a behavioral checklist, the rater is provided with a series of descriptive statements of job-related behavior. His or her task is simply to indicate (“check”) statements that describe the ratee in question. In this approach, raters are not so much evaluators as they are reporters of job behavior. Moreover, ratings that are descriptive are likely to be higher in reliability than ratings that are evaluative (Stockford & Bissell, 1949), and they reduce the cognitive demands placed on raters, valuably structuring their information processing (Hennessy, Mabey, & Warr, 1998).
To be sure, some job behaviors are more desirable than others; checklist items can, therefore, be scaled by using attitude-scale construction methods. In one such method, the Likert method of summated ratings, a declarative statement (e.g., “she follows through on her sales”) is followed by several response categories, such as “always,” “very often,” “fairly often,” “occasionally,” and “never.” The rater simply checks the response category he or she feels best describes the ratee. Each response category is weighted—for example, from 5 (“always”) to 1 (“never”) if the statement describes desirable behavior—or vice versa if the statement describes undesirable behavior. An overall numerical rating for each individual then can be derived by summing the weights of the responses that were checked for each item, and scores for each performance dimension can be obtained by using item analysis procedures (cf. Anastasi, 1988).
The selection of response categories for summated rating scales often is made arbitrarily, with equal intervals between scale points simply assumed. Scaled lists of adverbial modifiers of frequency and amount are available, however, together with statistically optimal four- to nine-point scales (Bass, Cascio, & O’Connor, 1974). Scaled values also are available for categories of agreement, evaluation, and frequency (Spector, 1976).
Checklists are easy to use and understand, but it is sometimes difficult for a rater to give diagnostic feedback based on checklist ratings, for they are not cast in terms of specific behaviors. On balance, however, the many advantages of checklists probably account for their widespread popularity in organizations today.
Forced-Choice System
A special type of behavioral checklist is known as the forced-choice system—a technique developed specifically to reduce leniency errors and establish objective standards of comparison between individuals (Sisson, 1948). To accomplish this, checklist statements are arranged in groups, from which the rater chooses statements that are most or least descriptive of the ratee. An overall rating (score) for each individual is then derived by applying a special scoring key to the rater descriptions.
Forced-choice scales are constructed according to two statistical properties of the checklist items: (1) discriminability, a measure of the degree to which an item differentiates effective from ineffective workers, and (2) preference, an index of the degree to which the quality expressed in an item is valued by (i.e., is socially desirable to) people. The rationale of the forced-choice system requires that items be paired so that they appear equally attractive (socially desirable) to the rater. Theoretically, then, the selection of any single item in a pair should be based solely on the item’s discriminating power, not on its social desirability.
As an example, consider the following pair of items:
1. Separates opinion from fact in written reports.
2. Includes only relevant information in written reports.
Both statements are approximately equal in preference value, but only item 1 was found to discriminate effective from ineffective performers in a police department. This is the defining characteristic of the forced-choice technique: Not all equally attractive behavioral statements are equally valid.
The main advantage claimed for forced-choice scales is that a rater cannot distort a person’s ratings higher or lower than is warranted, since he or she has no way of knowing which statements to check in order to do so. Hence, leniency should theoretically be reduced. Their major disadvantage is rater resistance. Since control is removed from the rater, he or she cannot be sure just how the subordinate was rated. Finally, forced-choice forms are of little use (and may even have a negative effect) in performance appraisal interviews, for the rater is unaware of the scale values of the items he or she chooses. Since rater cooperation and acceptability are crucial determinants of the success of any performance management system, forced-choice systems tend to be unpopular choices in many organizations.
Critical Incidents
This performance measurement method has generated a great deal of interest and several variations of the basic idea are currently in use. As described by Flanagan (1954a), the critical requirements of a job are those behaviors that make a crucial difference between doing a job effectively and doing it ineffectively. Critical incidents are simply reports by knowledgeable observers of things employees did that were especially effective or ineffective in accomplishing parts of their jobs. Supervisors record critical incidents for each employee as they occur. Thus, they provide a behaviorally based starting point for appraising performance. For example, in observing a police officer chasing an armed robbery suspect down a busy street, a supervisor recorded the following:
June 22, officer Mitchell withheld fire in a situation calling for the use of weapons where gunfire would endanger innocent bystanders.
These little anecdotes force attention on the situational determinants of job behavior and on ways of doing a job successfully that may be unique to the person described. The critical incidents method looks like a natural for performance management interviews because supervisors can focus on actual job behavior rather than on vaguely defined traits. Ratees receive meaningful feedback to which they can relate in a direct and concrete manner, and they can see what changes in their job behavior will be necessary in order for them to improve. In addition, when a large number of critical incidents are collected, abstracted, and categorized, they can provide a rich storehouse of information about job and organizational problems in general and are particularly well suited for establishing objectives for training programs (Flanagan & Burns, 1955).
As with other approaches to performance appraisal, the critical incidents method also has drawbacks. First, it is time consuming and burdensome for supervisors to record incidents for all of their subordinates on a daily or even weekly basis. Feedback may, therefore, be delayed. Nevertheless, incidents recorded in diaries allow raters to impose organization on unorganized information (DeNisi, Robbins, & Cafferty, 1989). Second, in their narrative form, incidents do not readily lend themselves to quantification, which, as we noted earlier, poses problems in between-individual and between-group comparisons, as well as in statistical analyses. For these reasons, a modification has been the development of behaviorally anchored rating scales, an approach we consider shortly.
Graphic Rating Scales
Probably the most widely used method of performance rating is the graphic rating scale, examples of which are presented in Figure 5.2. In terms of the amount of structure provided, the scales differ in three ways: (1) the degree to which the meaning of the response categories is defined, (2) the degree to which the individual who is interpreting the ratings (e.g., an HR manager or researcher) can tell clearly what response was intended, and (3) the degree to which the performance dimension being rated is defined for the rater.

Figure 5.2 Examples of Graphic Rating Scales
On a graphic rating scale, each point is defined on a continuum. Hence, to make meaningful distinctions in performance within dimensions, scale points must be defined unambiguously for the rater. This process is called anchoring. Scale (a) uses qualitative end anchors only. Scales (b) and (e) include numerical and verbal anchors, while scales (c), (d), and (f) use verbal anchors only. These anchors are almost worthless, however, since what constitutes high and low quality or “outstanding” and “unsatisfactory” is left completely up to the rater. A “commendable” for one rater may be only a “competent” for another. Scale (e) is better, for the numerical anchors are described in terms of what “quality” means in that context.
The scales also differ in terms of the relative ease with which a person interpreting the ratings can tell exactly what response was intended by the rater. In scale (a), for example, the particular value that the rater had in mind is a mystery. Scale (e) is less ambiguous in this respect.
Finally, the scales differ in terms of the clarity of the definition of the performance dimension in question. In terms of Figure 5.2, what does quality mean? Is quality for a nurse the same as quality for a cashier? Scales (a) and (c) offer almost no help in defining quality, scale (b) combines quantity and quality together into a single dimension (although typically they are independent), and scales (d) and (e) define quality in different terms altogether (thoroughness, dependability, and neatness versus accuracy, effectiveness, and freedom from error). Scale (f) is an improvement in the sense that, although quality is taken to represent accuracy, effectiveness, initiative, and neatness (a combination of scale (d) and (e) definitions), at least separate ratings are required for each aspect of quality.
Graphic rating scales may not yield the depth of information that narrative essays or critical incidents do, but they (a) are less time consuming to develop and administer, (b) permit quantitative results to be determined, (c) promote consideration of more than one performance dimension, and (d) are standardized and, therefore, comparable across individuals. A drawback is that graphic rating scales give maximum control to the rater, thereby exercising no control over leniency, severity, central tendency, or halo. For this reason, they have been criticized. However, when simple graphic rating scales have been compared against more sophisticated forced-choice ratings, the graphic scales consistently proved just as reliable and valid (King, Hunter, & Schmidt, 1980) and were more acceptable to raters (Bernardin & Beatty, 1991).
Behaviorally Anchored Rating Scales
How can graphic rating scales be improved? According to Smith and Kendall (1963):

Better ratings can be obtained, in our opinion, not by trying to trick the rater (as in forced-choice scales) but by helping him to rate. We should ask him questions which he can honestly answer about behaviors which he can observe. We should reassure him that his answers will not be misinterpreted, and we should provide a basis by which he and others can check his answers. (p. 151)
Their procedure is as follows. At an initial conference, a group of workers and/or supervisors attempts to identify and define all of the important dimensions of effective performance for a particular job. A second group then generates, for each dimension, critical incidents illustrating effective, average, and ineffective performance. A third group is then given a list of dimensions and their definitions, along with a randomized list of the critical incidents generated by the second group. Their task is to sort or locate incidents into the dimensions they best represent (Hauenstein, Brown, & Sinclair, 2010).
This procedure is known as retranslation, since it resembles the quality control check used to ensure the adequacy of translations from one language into another. Material is translated into a foreign language by one translator and then retranslated back into the original by an independent translator. In the context of performance appraisal, this procedure ensures that the meanings of both the job dimensions and the behavioral incidents chosen to illustrate them are specific and clear. Incidents are eliminated if there is not clear agreement among judges (usually 60–80%) regarding the dimension to which each incident belongs. Dimensions are eliminated if incidents are not allocated to them. Conversely, dimensions may be added if many incidents are allocated to the “other” category.
Each of the items within the dimensions that survived the retranslation procedure is then presented to a fourth group of judges, whose task is to place a scale value on each incident (e.g., in terms of a seven- or nine-point scale from “highly effective behavior” to “grossly ineffective behavior”). The end product looks like that in Figure 5.3.

Figure 5.3 Scaled Expectations Rating for the Effectiveness With Which the Department Manager Supervises His or Her Sales Personnel

Source: Campbell, J. P., Dunnette, M. D., Arvey, R. D., & Hellervik, L. V. (1973). The development and evaluation of behaviorally based rating scales. Journal of Applied , 57, 15–22. Copyright 1973 by the American Psychological Association.
As you can see, behaviorally anchored rating scales (BARS) development is a long, painstaking process that may require many individuals. Moreover, separate BARS must be developed for dissimilar jobs. Nevertheless, they are used quite frequently. For example, results of a survey involving hotels showed that about 40% used BARS (Woods, Sciarini, & Breiter, 1998).
How have BARS worked in practice? An enormous amount of research on BARS has been published (e.g., Maurer, 2002). At the risk of oversimplification, major known effects of BARS are summarized in Table 5.3 (cf. Bernardin & Beatty, 1991). A perusal of this table suggests that little empirical evidence supports the superiority of BARS over other performance rating systems.

Table 5.3 Known Effects of BARS

Participation

Participation seems to enhance the validity of ratings, but no more so for BARS than for simple graphic rating scales.

Leniency, central tendency, halo, reliability BARS not superior to other methods (reliabilities across dimensions in published studies range from about .52 to .76).

External validity Moderate (R2s of .21 to .47—Shapira and Shirom, 1980) relative to the upper limits of validity in performance ratings (Borman, 1978; Weekley & Gier, 1989).

Comparisons with other formats BARS no better or worse than other methods.

Variance in dependent variables associated with differences in rating systems Less than 5%. Rating systems affect neither the level of ratings (Harris & Schaubroeck, 1988), nor subordinates’ satisfaction with feedback (Russell & Goode, 1988).

Convergent/discriminant validity Low convergent validity, extremely low discriminant validity.

Specific content of behavioral anchors Anchors depicting behaviors observed by raters, but unrepresentative of true performance levels, produce ratings biased in the direction of the anchors (Murphy & Constans, 1987). This is unlikely to have a major impact on ratings collected in the field (Murphy & Pardaffy, 1989).

Summary Comments on Rating Formats and Rating Process
For several million workers today, especially those in the insurance, communications, transportation, and banking industries, being monitored on the job by a computer is a fact of life (Kurtzberg, Naquin, & Belkin, 2005; Tomczak, Lanzo, & Aguinis, 2018). In most jobs, though, human judgment about individual job performance is inevitable, no matter what format is used. This is the major problem with all formats.
Unless observation of ratees is extensive and representative, it is not possible for judgments to represent a ratee’s true performance. Since the rater must often make inferences about performance, the appraisal is subject to all the biases that have been linked to rating scales. Raters are free to distort their appraisals to suit their purposes. This can undo all of the painstaking work that went into scale development and probably explains why no single rating format has been shown to be superior to others.
What can be done? Both Banks and Roberson (1985) and Härtel (1993) suggest two strategies: First, build in as much structure as possible in order to minimize the amount of discretion exercised by a rater. For example, use job analysis to specify what is really relevant to effective job performance, and use critical incidents to specify levels of performance effectiveness in terms of actual job behavior. Second, don’t require raters to make judgments that they are not competent to make; don’t tax their abilities beyond what they can do accurately. For example, for formats that require judgments of frequency, make sure that raters have had sufficient opportunity to observe ratees so that their judgments are accurate.

Factors Affecting Subjective Appraisals

As we discussed earlier, performance appraisal is a complex process that may be affected by many factors, including organizational, political, and interpersonal barriers. In fact, idiosyncratic variance (i.e., variance due to the rater) has been found to be a larger component of variance in performance ratings than the variance attributable to actual ratee performance (Greguras & Robie, 1998; Scullen, Mount, & Goff, 2000). For example, rater variance was found to be 1.21 times larger than ratee variance for supervisory ratings, 2.08 times larger for peer ratings, and 1.86 times larger for subordinate ratings (Scullen et al., 2000). In addition, raters may be motivated to inflate ratings for reasons completely unrelated to the true nature of an employee’s performance, such as the desire to avoid confrontation with subordinates, promote a problem employee out of the unit, look like a competent manager, and procure resources (Spence & Keeping, 2011). In this section, we consider individual differences in raters and in ratees (and their interaction) and how these factors affect performance ratings. Findings in each of these areas are summarized in
Tables 5.4
,
5.5
, and
5.6
. For each variable listed in the tables, an illustrative reference is provided for those who wish to find more specific information.

Table 5.4 Summary of Findings on Rater Characteristics and Performance Ratings

Personal Characteristics

Gender

No general effect (Landy & Farr, 1980).

Race

African American raters rate whites slightly higher than they rate African Americans. White and African American raters differ very little in their ratings of white ratees (Sackett & DuBois, 1991).

Age

No consistent effects (Schwab & Heneman, 1978).

Education

Statistically significant, but extremely weak effect (Cascio & Valenzi, 1977).

Low self-confidence; increased psychological distance

More critical, negative ratings (Rothaus, Morton, & Hanson, 1965).

Interests, social insight, intelligence

No consistent effect (Zedeck & Kafry, 1977).

Personality characteristics

Raters high on agreeableness (r = .25), extraversion (r = .12), and emotional stability (r = .12) are more likely to provide higher ratings and the Big Five personality traits alone account for between 6% and 22% of the variance in performance ratings (Cheng, Hui, & Cascio, 2017; Harari, Rudolph, & Laginess, 2015). Raters high on conscientiousness are more likely to give higher ratings to older workers than to younger workers (Kmicinska, Zaniboni, Truxillo, Fraccaroli, & Wang, 2016). Raters high on self-monitoring are more likely to provide more accurate ratings (Jawahar, 2001). Attitudes toward performance appraisal affect rating behavior more strongly for raters low on conscientiousness (Tziner, Murphy, & Cleveland, 2002).

Job-Related Variables

Accountability

Raters who are accountable for their ratings provide more accurate ratings than those who are not accountable (Mero & Motowidlo, 1995).

Job experience

Statistically significant, but weak positive effect on quality of ratings (Cascio & Valenzi, 1977).

Performance level

Effective performers tend to produce more reliable and valid ratings (Kirchner & Reisberg, 1962).

Leadership style

Supervisors who provide little structure to subordinates’ work activities tend to avoid formal appraisals (Fried, Tiegs, & Bellamy, 1992).

Organizational position

(See earlier discussion of “Who Shall Rate?”)

Rater knowledge of ratee and job

Relevance of contact to the dimensions rated is critical. Ratings are less accurate when delayed rather than immediate and when observations are based on limited data (Heneman & Wexley, 1983).

Prior expectations and information

Disconfirmation of expectations (higher or lower than expected) lowers ratings (Hogan, 1987). Prior information may bias ratings in the short run. Over time, ratings reflect actual behavior (Hanges, Braverman, & Rentch, 1991).

Stress

Raters under stress rely more heavily on first impressions and make fewer distinctions among performance dimensions (Srinivas & Motowidlo, 1987).

Table 5.5 Summary of Findings on Ratee Characteristics and Performance Ratings

Personal Characteristics

Gender

Females tend to receive lower ratings than males when they make up less than 20% of a work group, but higher ratings than males when they make up more than 50% of a work group (Sackett, DuBois, & Noe, 1991). Female ratees received more accurate ratings than male ratees (Sundvik & Lindeman, 1998). Female employees in line jobs tend to receive lower performance ratings than female employees in staff jobs or men in either line or staff jobs (Lyness & Heilman, 2006).

Race

Race of the ratee accounts for between 1% and 5% of the variance in ratings (Borman, White, Pulakos, & Oppler, 1991; Oppler, Campbell, Pulakos, & Borman, 1992).

Age

Older subordinates were rated lower than younger subordinates (Ferris, Yates, Gilmore, & Rowland, 1985) by both African American and white raters (Crew, 1984).

Education

No statistically significant effects (Cascio & Valenzi, 1977).

Emotional disability

Workers with emotional disabilities received higher ratings than warranted, but such positive bias disappears when clear standards are used (Czajka & DeNisi, 1988).

Job-Related Variables

Performance level

Actual performance level and ability have the strongest effect on ratings (Borman et al., 1991; Borman et al., 1995; Vance, Winne, & Wright, 1983). More weight is given to negative than to positive attributes of ratees (Ganzach, 1995).

Group composition

Ratings tend to be higher for satisfactory workers in groups with a large proportion of unsatisfactory workers (Grey & Kipnis, 1976), but these findings may not generalize to all occupational groups (Ivancevich, 1983).

Tenure

Although age and tenure are highly related, evidence indicates no relationship between ratings and either ratee tenure in general or ratee tenure working for the same supervisor (Ferris et al., 1985).

Job satisfaction

Knowledge of a ratee’s job satisfaction may bias ratings in the same direction (+ or –) as the ratee’s satisfaction (Smither, Collins, & Buda, 1989).

Personality characteristics

Both peers and supervisors rate dependability highly. However, obnoxiousness affects peer raters much more than supervisors (Borman et al., 1995).

Table 5.6 Summary of Findings on Interaction of Rater–Ratee Characteristics and Performance Ratings

Gender

In the context of merit pay and promotions, female ratees receive less favorable scores with greater negative bias by raters who hold traditional stereotypes about women (Dobbins, Cardy, & Truxillo, 1988).

Race

Both white and African American raters consistently assign lower ratings to African American ratees than to white ratees. White and African American raters differ very little in their ratings of white ratees (Oppler et al., 1992; Sackett & DuBois, 1991). Race effects may disappear when cognitive ability, education, and experience are taken into account (Waldman & Avolio, 1991).

Actual versus perceived similarity

Actual similarity (agreement between supervisor–subordinate work-related self-descriptions) is a weak predictor of performance ratings (Wexley, Alexander, Greenawalt, & Couch, 1980), but perceived similarity is a strong predictor (Turban & Jones, 1988; Wayne & Liden, 1995).

Performance attributions

Age and job performance are generally unrelated (McEvoy & Cascio, 1989).

Citizenship behaviors

Dimension ratings of ratees with high levels of citizenship behaviors show high halo effects (Werner, 1994). Task performance and contextual performance interact in affecting reward decisions (Kiker & Motowidlo, 1999).

Length of relationship

Longer relationships resulted in more accurate ratings (Sundvik & Lindeman, 1998).

Personality characteristics

Similarity regarding conscientiousness increases ratings of contextual work behaviors, but there is no relationship for agreeableness, extraversion, neuroticism, or openness to experience (Antonioni & Park, 2001).

As the tables demonstrate, we now know a great deal about the effects of selected individual differences variables on ratings of job performance. However, there is a great deal more that we do not know. Accordingly, there is ongoing research proposing new formats and procedures (e.g., Hoffman et al., 2012). Above all, however, recognize that the process of performance appraisal, including the social and emotional context, and not just the mechanics of collecting performance data, determines the overall effectiveness of this essential component of all performance management systems (Djurdjevic & Wheeler, 2014).
Evaluating the Performance of Teams
Our discussion thus far has focused on the measurement of employees working independently and not in groups. We have been focusing on the assessment and improvement of individual performance. However, numerous organizations are structured around teams (Hollenbeck, Beersma, & Schouten, 2012). Team-based organizations do not necessarily outperform organizations that are not structured around teams (Hackman, 1998). However, the interest in, and implementation of, team-based structures does not seem to be subsiding; on the contrary, there seems to be an increased interest in organizing how work is done around teams (Naquin & Tynan, 2003). Therefore, given the popularity of teams, it makes sense for performance management systems to target not only individual performance but also an individual’s contribution to the performance of his or her team(s), as well as the performance of teams as a whole (Aguinis, Gottfredson, & Joo, 2013a; Li, Zheng, Harris, Liu, & Kirkman, 2016).
The assessment of team performance does not imply that individual contributions should be ignored. On the contrary, if individual performance is not assessed and recognized, social loafing may occur (Scott & Einstein, 2001). Even worse, when other team members see there is a “free rider,” they are likely to withdraw their effort in support of team performance (Heneman & von Hippel, 1995). Assessing overall team performance based on team-based processes and team-based results should therefore be seen as complementary to the assessment and recognition of (a) individual performance (as we have discussed thus far), and (b) individuals’ behaviors and skills that contribute to team performance (e.g., self-management, communication, decision making, collaboration; Aguinis, Gottfredson, & Joo, 2013a).
Meta-analysis results provide evidence to support the need to assess and reward both individual and team performance because they have complementary effects (Garbers & Konradt, 2014). Thus, the average effect size of using individual incentives on individual performance is g = 0.32 (based on 116 separate studies), and the average effect size of using team incentives on team performance is g = 0.34 (based on 30 studies). (The effect size g is similar to d—a standardized mean difference between two groups.)
Not all teams are created equally, however. Different types of teams require different emphases on performance measurement at the individual and team levels. Depending on the complexity of the task (from routine to nonroutine) and the membership configuration (from static to dynamic), we can identify three different types of teams (Scott & Einstein, 2001):
· Work or service teams: Intact teams engaged in routine tasks (e.g., manufacturing or service tasks)
· Project teams: Teams assembled for a specific purpose and expected to disband once their task is complete; their tasks are outside the core production or service of the organization and, therefore, less routine than those of work or service teams.
· Network teams: Teams whose membership is not constrained by time or space or limited by organizational boundaries (i.e., they are typically geographically dispersed and stay in touch via telecommunications technology); their work is extremely nonroutine.

Table 5.7
summarizes recommended measurement methods for each of the three types of teams. For example, regarding project teams, the duration of a particular project limits the utility of outcome-based assessments. Specifically, end-of-project outcome measures may not benefit the team’s development because the team is likely to disband once the project is over. Instead, measurements taken during the project can be implemented, so that corrective action can be taken if necessary before the project is over. This is what Hewlett-Packard uses with its product development teams (Scott & Einstein, 2001). Irrespective of the type of team that is evaluated, the interpersonal relationships among the team members play a central role in the resulting ratings (Greguras, Robie, Born, & Koenigs, 2007). For example, self-ratings are related to how one rates, and how one is rated by, others. Particularly for performance dimensions related to interpersonal issues, team members are likely to reciprocate the type of rating they receive.

Table 5.7 Performance Appraisal Methods for Different Types of Teams

What Is Rated?

How Is the Rating Used?

Team Type

Who Is Being Rated

Who Provides Rating

Outcome

Behavior

Competency

Development

Evaluation

Self-Regulation

Work or service team

Team member

Manager

✓

—

Other team members

—

✓

—

Customers

—

✓

—

✓

—

Self

✓

—

✓

Manager

✓

—

Other teams

—

✓

—

Entire team

Customers

—

✓

—

Self

✓

—

✓

Project team

Team member

Manager

✓

—

✓

—

Project leaders

—

✓

—

Other team members

—

✓

—

Customers

—

✓

—

Self

✓

—

✓

Entire team

Customers

✓

—

✓

—

Self

✓

—

✓

Network team

Team member

Manager

—

✓

—

Team leaders

—

✓

—

Coworkers

—

✓

—

Other team members

—

✓

—

Customers

—

✓

—

Self

✓

—

✓

Entire team

Customers

✓

—

✓

—

Source: Republished with permission of Academy of Management, from Scott, S. G., & Einstein, W. O. (2001). Strategic performance appraisal in team-based organizations: One size does not fit all. Academy of Management Executive, 15, 111; permission conveyed through Copyright Clearance Center, Inc.
Regardless of whether performance is measured at the individual level or at the individual and team levels, raters are likely to make intentional or unintentional mistakes in assigning performance scores (Naquin & Tynan, 2003). They can be trained to minimize such biases, as the next section demonstrates.

Rater Training

The first step in the design of any training program is to specify objectives. In the context of rater training, there are three broad objectives: (1) to improve the observational skills of raters by teaching them what to attend to, (2) to reduce or eliminate judgmental biases, and (3) to improve the ability of raters to communicate performance information to ratees in an objective and constructive manner.
Traditionally, rater training has focused on teaching raters to eliminate judgmental biases such as leniency, central tendency, and halo effects (Bernardin & Buckley, 1981). This approach assumes that certain rating distributions are more desirable than others (e.g., normal distributions, variability in ratings across dimensions for a single person). Raters may learn a new response set that results in lower average ratings (less leniency) and greater variability in ratings across dimensions (less halo), but their accuracy tends to decrease (Hedge & Kavanagh, 1988; Murphy & Balzer, 1989). Note, however, that accuracy in appraisal has been defined in different ways by researchers and relations among different operational definitions of accuracy are generally weak (Sulsky & Balzer, 1988). In addition, rater training programs that attempt to eliminate systematic errors typically have only short-term effects (Fay & Latham, 1982). Regarding unintentional errors, rater error training (RET) exposes raters to the different errors and their causes. Raters may receive training on the various errors they may make, but this awareness does not necessarily lead to the elimination of such errors (London, Mone, & Scott, 2004). Being aware of the unintentional errors does not mean that supervisors will no longer make them. Awareness is certainly a good first step, but we need to go further if we want to minimize unintentional errors. One fruitful possibility is the implementation of frame-of-reference (FOR) training
Meta-analytic evidence has demonstrated that FOR training is effective in improving the accuracy of performance appraisals, with an average effect size of d = .50 (Roch, Woehr, Mishra, & Kieszczynska, 2012). Additional evidence suggests that other types of training in combination with FOR training does not seem to improve rating accuracy beyond the effects of FOR training alone (Noonan & Sulsky, 2001). Following procedures developed by Pulakos (1984, 1986), such FOR training proceeds as follows:
1. Participants are told that they will evaluate the performance of three ratees on three separate performance dimensions.
2. They are given rating scales and instructed to read them as the trainer reads the dimension definitions and scale anchors aloud.
3. The trainer then discusses ratee behaviors that illustrate different performance levels for each scale. The goal is to create a common performance theory (frame of reference) among raters such that they will agree on the appropriate performance dimension and effectiveness level for different behaviors.
4. Participants are shown a videotape of a practice vignette and are asked to evaluate the manager using the scales provided.
5. Ratings are then written on a blackboard and discussed by the group of participants. The trainer seeks to identify which behaviors participants used to decide on their assigned ratings and to clarify any discrepancies among the ratings.
6. The trainer provides feedback to participants, explaining why the ratee should receive a certain rating (target score) on a given dimension.
FOR training provides trainees with a “theory of performance” that allows them to understand the various performance dimensions, how to match these performance dimensions to rate behaviors, how to judge the effectiveness of various ratee behaviors, and how to integrate these judgments into an overall rating of performance (Sulsky & Day, 1992). In addition, the provision of rating standards and behavioral examples appears to be responsible for the improvements in rating accuracy. The use of target scores in performance examples and accuracy feedback on practice ratings allows raters to learn, through direct experience, how to use the different rating standards. In essence, the FOR training is a microcosm that includes an efficient model of the process by which performance-dimension standards are acquired (Stamoulis & Hauenstein, 1993).
Nevertheless, the approach described here assumes a single frame of reference for all raters. Research has shown that different sources of performance data (peers, supervisors, subordinates) demonstrate distinctly different FORs and that they disagree about the importance of poor performance incidents (Hauenstein & Foti, 1989). Therefore, training should highlight these differences and focus both on the content of the raters’ performance theories and on the process by which judgments are made (Schleicher & Day, 1998). Finally, the training process should identify idiosyncratic raters so that their performance in training can be monitored to assess improvement.
Rater training is clearly worth the effort, and the kind of approach advocated here is especially effective in improving the accuracy of ratings for individual ratees on separate performance dimensions (Day & Sulsky, 1995). In addition, trained managers are more effective in formulating development plans for subordinates (Davis & Mount, 1984). The technical and interpersonal problems associated with performance appraisal are neither insurmountable nor inscrutable; they simply require the competent and systematic application of sound psychological principles.

The Social, Emotional, and Interpersonal Context of Performance Management Systems

Throughout this chapter, we have emphasized that performance management systems encompass measurement issues, as well as attitudinal and behavioral issues. Traditionally, we have tended to focus our research efforts on measurement issues per se; yet any measurement instrument or rating format probably has only a limited impact on performance appraisal scores (Banks & Roberson, 1985). Broader issues in performance management must be addressed, since appraisal outcomes are likely to represent an interaction among organizational contextual variables, rating formats, and rater and ratee motivation.
Several studies have assessed the attitudinal implications of various types of performance management systems (e.g., Kinicki, Prussia, Bin, & McKee-Ryan, 2004). This body of literature focuses on different types of reactions, including satisfaction, fairness, perceived utility, and perceived accuracy (see Keeping & Levy, 2000, for a review of measures used to assess each type of reaction). The reactions of participants to a performance management system are important because they are linked to system acceptance and success (Björkman, Ehrnrooth, Mäkelä, Smale, & Sumelius, 2013). In addition, there is evidence regarding the existence of an overall multidimensional reaction construct (Keeping & Levy, 2000). The various types of reactions can be conceptualized as separate yet related entities.
As an example of one type of reaction, consider some of the evidence gathered regarding the perceived fairness of the system. Fairness, as conceptualized in terms of due process, includes two types of facets: (1) process facets or interactional justice—interpersonal exchanges between supervisor and employees; and (2) system facets or procedural justice—structure, procedures, and policies of the system (Findley, Giles, & Mossholder, 2000; Masterson, Lewis, Goldman, & Taylor, 2000). Results of a selective set of studies indicate the following:
· Process facets explain variance in contextual performance beyond that accounted for by system facets (Findley et al., 2000).
· Managers who have perceived unfairness in their own most recent performance evaluations are more likely to react favorably to the implementation of a procedurally just system than are those who did not perceive unfairness in their own evaluations (Taylor, Masterson, Renard, & Tracy, 1998).
· Appraisers are more likely to engage in interactionally fair behavior when interacting with an assertive appraisee than with an unassertive appraisee (Korsgaard, Roberson, & Rymph, 1998).
This kind of knowledge illustrates the importance of the social and motivational aspects of performance management systems (Fletcher, 2001). In implementing a system, this type of information is no less important than the knowledge that a new system results in, for example, less halo, leniency, and central tendency. Both types of information are meaningful and useful; both must be considered in the wider context of performance management. In support of this view, a review of 295 U.S. circuit court decisions rendered from 1980 to 1995 regarding performance appraisal concluded that issues relevant to fairness and due process were most salient in making the judicial decisions (Werner & Bolino, 1997).
Finally, to reinforce the view that context must be taken into account and that performance management must be tackled from both a technical as well as an interpersonal issue, Aguinis and Pierce (2008) offered the following recommendations regarding issues that should be explored further:
· Social power, influence, and leadership: A supervisor’s social power refers to his or her ability, as perceived by others, to influence behaviors and outcomes (Farmer & Aguinis, 2005). If an employee believes that his or her supervisor has the ability to influence important tangible and intangible outcomes (e.g., financial rewards, recognition), then the performance management system is likely to be more meaningful. Thus, future research could attempt to identify the conditions under which supervisors are likely to be perceived as more powerful and the impact of these power perceptions on the meaningfulness and effectiveness of performance management systems.
· Trust: The “collective trust” of all stakeholders in the performance management process is crucial for the system to be effective (Farr & Jacobs, 2006). Given the current business reality of downsizing and restructuring efforts, how can trust be created so that organizations can implement successful performance management systems? Stated differently, future research could attempt to understand conditions under which dyadic, group, and organizational factors are likely to enhance trust and, consequently, enhance the effectiveness of performance management systems.
· Social exchange: The relationship between individuals (and groups) and organizations can be conceptualized within a social exchange framework. Specifically, individuals and groups display behaviors and produce results that are valued by the organization, which in turn provides tangible and intangible outcomes in exchange for those behaviors and results. Thus, future research using a social exchange framework could inform the design of performance management systems by providing a better understanding of the perceived fairness of various types of exchange relationships and the conditions under which the same types of relationships are likely to be perceived as being more or less fair.
· Group dynamics and close interpersonal relationships: It is virtually impossible to think of an organization that does not organize its functions, at least in part, based on teams. Consequently, many organizations include a team component in their performance management system (Aguinis, 2019). Such systems usually target individual performance and also an individual’s contribution to the performance of his or her team(s) and the performance of teams as a whole. Within the context of such performance management systems, future research could investigate how group dynamics affect who measures performance and how performance is measured. Future research could also attempt to understand how close personal relationships such as supervisor–subordinate workplace romances (Pierce, Aguinis, & Adams, 2000; Pierce, Broberg, McClure, & Aguinis, 2004), which often involve conflicts of interest, may affect the successful implementation of performance management systems.
Performance Feedback: Appraisal and Goal-Setting Interviews
One of the central purposes of performance management systems is to serve as a personal development tool. To improve, there must be some feedback regarding present performance. However, the mere presence of performance feedback does not guarantee a positive effect on future performance. In fact, a meta-analysis of 131 studies showed that, overall, feedback has a positive effect on performance (less than one half of one standard deviation improvement in performance), but 38% of the feedback interventions reviewed had a negative effect on performance (Kluger & DeNisi, 1996). Thus, in many cases, feedback does not have a positive effect; in fact, it can have a harmful effect on future performance. For instance, if feedback results in an employee’s focusing attention on himself or herself instead of the task at hand, then feedback is likely to have a negative effect. Consider the example of a woman who has made many personal sacrifices to reach the top echelons of her organization’s hierarchy. She might be devastated to learn she has failed to keep a valued client and then may begin to question her life choices instead of focusing on how not to lose valued clients in the future (DeNisi & Kluger, 2000).
Although performance information may be gathered from several sources, responsibility for communicating such feedback from multiple sources by means of an appraisal interview often rests with the immediate supervisor (Ghorpade & Chen, 1995). A formal system for giving feedback should be implemented because, in the absence of such a system, some employees are more likely than others to seek and benefit from feedback. For example, consider the relationship between stereotype threat (i.e., a fear of confirming a negative stereotype about one’s group through one’s own behavior; Farr, 2003) and the willingness to seek feedback. A study including 166 African American managers in utilities industries found that being the only African American in the workplace was related to stereotype threat and that stereotype threat was negatively related to feedback seeking (Roberson, Deitch, Brief, & Block, 2003). Thus, if no formal performance feedback system is in place, employees who do not perceive a stereotype threat will be more likely to seek feedback from their supervisors and benefit from it. This, combined with the fact that people generally are apprehensive about both receiving and giving performance information, reinforces the notion that the implementation of formal job feedback systems is necessary (London, 2003).
Ideally, a continuous feedback process should exist between superior and subordinate so that both may be guided. This can be facilitated by the fact that in many organizations electronic performance monitoring (EPM) is common practice (e.g., number or duration of phone calls with clients, duration of log-in time). EPM is qualitatively different from more traditional methods of collecting performance data (e.g., direct observation) because it can occur continuously and produces voluminous data on multiple performance dimensions (Tomczak et al., 2018). However, the availability of data resulting from EPM, often stored online and easily retrievable by the employees, does not diminish the need for face-to-face interaction with the supervisor, who is responsible not only for providing the information but also for interpreting it and helping guide future performance. In practice, however, supervisors frequently “save up” performance-related information for a formal appraisal interview, the conduct of which is an extremely trying experience for both parties. Most supervisors resist “playing God” (playing the role of judge) and then communicating their judgments to subordinates (McGregor, 1957). Hence, supervisors may avoid confronting uncomfortable issues, but even if they do, subordinates may only deny or rationalize them in an effort to maintain self-esteem (Larson, 1989). Thus, the process is self-defeating for both groups. Fortunately, this need not always be the case. Based on findings from appraisal interview research, Table 5.8 presents several activities that supervisors should engage in before, during, and after appraisal interviews. Let’s briefly consider each of them.

Table 5.8 Supervisory Activities Before, During, and After the Appraisal Interview

Before

Communicate frequently with subordinates about their performance
Get training in performance appraisal
Judge your own performance first before judging others
Encourage subordinates to prepare for appraisal interviews
Be exposed to priming information to help retrieve information from memory

During

Warm up and encourage subordinate participation
Judge performance, not personality, mannerisms, or self-concept
Be specific
Be an active listener
Avoid destructive criticism and threats to the employee’s ego
Set mutually agreeable and formal goals for future improvement

After

Communicate frequently with subordinates about their performance
Periodically assess progress toward goals
Make organizational rewards contingent on performance

Communicate Frequently
Two of the clearest results from research on the appraisal interview are that once-a-year performance appraisals are of questionable value and that coaching should be done much more frequently—particularly for poor performers and with new employees (Cederblom, 1982; Meyer, 1991). Feedback has maximum impact when it is given as close as possible to the action. If a subordinate behaves effectively, tell him or her immediately; if he or she behaves ineffectively, also tell him or her immediately. Do not file these incidents away so that they can be discussed in six to nine months.
Get Training in Appraisal
As we noted earlier, increased emphasis should be placed on training raters to observe behavior more accurately and fairly rather than on providing specific illustrations of “how to” or “how not to” rate. Training managers on how to provide evaluative information and to give feedback should focus on characteristics that are difficult to rate and on characteristics that people think are easy to rate, but that generally result in disagreements. Such factors include risk taking and development (Wohlers & London, 1989).
Judge Your Own Performance First
We often use ourselves as the norm or standard by which to judge others. Although this tendency may be difficult to overcome, research findings in the area of interpersonal perception can help us improve the process (Kraiger & Aguinis, 2001). A selective list of such findings includes the following:
· Self-protection mechanisms like denial, giving up, self-promotion, and fear of failure have a negative influence on self-awareness.
· Knowing oneself makes it easier to see others accurately and is itself a managerial ability.
· One’s own characteristics affect the characteristics one is likely to see in others.
· The person who accepts himself or herself is more likely to be able to see favorable aspects of other people.
· Accuracy in perceiving others is not a single skill (Wohlers & London, 1989; Zalkind & Costello, 1962)
Encourage Subordinate Preparation
Research conducted in a large Midwestern hospital indicated that the more time employees spent prior to appraisal interviews analyzing their job duties and responsibilities, the problems being encountered on the job, and the quality of their performance, the more likely they were to be satisfied with the appraisal process, to be motivated to improve their own performance, and actually to improve their performance (Burke, Weitzel, & Weir, 1978). To foster such preparation, (a) a BARS form could be developed for this purpose, and subordinates could be encouraged or required to use it (Silverman & Wexley, 1984); (b) employees could be provided with the supervisor’s review prior to the appraisal interview and encouraged to react to it in specific terms; and (c) employees could be encouraged or required to appraise their own performance on the same criteria or forms their supervisor uses (Farh, Werbel, & Bedeian, 1988).
Self-review has at least four advantages: (1) It enhances the subordinate’s dignity and self-respect; (2) it places the manager in the role of counselor, not judge; (3) it is more likely to promote employee commitment to plans or goals formulated during the discussion; and (4) it is likely to be more satisfying and productive for both parties than is the more traditional manager-to-subordinate review (Meyer, 1991).
Use “Priming” Information
A prime is a stimulus given to the rater to trigger information stored in long-term memory. There are numerous ways to help a rater retrieve information about a ratee’s performance from memory before the performance-feedback session. For example, an examination of documentation regarding each performance dimension and behaviors associated with each dimension can help improve the effectiveness of the feedback session (cf. Jelley & Goffin, 2001).
Warm Up and Encourage Participation
Research shows generally that the more a subordinate feels he or she participated in the interview by presenting his or her own ideas and feelings, the more likely the subordinate is to feel that the supervisor was helpful and constructive, that some current job problems were cleared up, and that future goals were set. However, these conclusions are true only as long as the appraisal interview represents a low threat to the subordinate, he or she previously has received an appraisal interview from the superior, he or she is accustomed to participating with the superior, and he or she is knowledgeable about issues to be discussed in the interview (Cederblom, 1982).
Judge Performance, Not Personality or Self-Concept
The more a supervisor focuses on the personality and mannerisms of his or her subordinate rather than on aspects of job-related behavior, the lower the satisfaction of both supervisor and subordinate and the less likely the subordinate is to be motivated to improve his or her performance (Burke et al., 1978). Also, an emphasis on the employee as a person or on his or her self-concept, as opposed to the task and task performance only, is likely to lead to lower levels of future performance (DeNisi & Kluger, 2000).
Be Specific
Appraisal interviews are more likely to be successful to the extent that supervisors are perceived as constructive and helpful (Russell & Goode, 1988). By being candid and specific, the supervisor offers clear feedback to the subordinate concerning past actions. He or she also demonstrates knowledge of the subordinate’s level of performance and job duties. One can be specific about positive as well as negative behaviors on a job. Data show that the acceptance and perception of accuracy of feedback by a subordinate are strongly affected by the order in which positive or negative information is presented. Begin the appraisal interview with positive feedback associated with minor issues, and then proceed to discuss feedback regarding major issues. Praise concerning minor aspects of behavior should put the individual at ease and reduce the dysfunctional blocking effect associated with criticisms (Stone, Gueutal, & McIntosh, 1984). It is helpful to maximize information relating to performance improvements and minimize information concerning the relative performance of other employees (DeNisi & Kluger, 2000).
Be an Active Listener
Have you ever seen two people in a heated argument who are so intent on making their own points that each one has no idea what the other person is saying? That is the opposite of “active” listening, where the objective is to empathize, to stand in the other person’s shoes and try to see things from her or his point of view (Itzchakov, Kluger, & Castro, 2017).
For example, during an interview with her boss, a member of a project team says: “I don’t want to work with Sally anymore. She’s lazy and snooty and complains about the rest of us not helping her as much as we should. She thinks she’s above this kind of work and too good to work with the rest of us and I’m sick of being around her.” The supervisor replies, “Sally’s attitude makes the work unpleasant.”
By reflecting what the woman said, the supervisor is encouraging her to confront her feelings and letting her know that she understands them. Active listeners are attentive to verbal as well as nonverbal cues, and, above all, they accept what the other person is saying without argument or criticism. Listen to and treat each individual with the same amount of dignity and respect that you yourself demand.
Avoid Destructive Criticism and Threats to the Employee’s Ego
Destructive criticism is general in nature; is frequently delivered in a biting, sarcastic tone; and often attributes poor performance to internal causes (e.g., lack of motivation or ability). Evidence indicates that employees are strongly predisposed to attribute performance problems to factors beyond their control (e.g., inadequate materials, equipment, instructions, or time) as a mechanism to maintain their self-esteem (Larson, 1989). Not surprisingly, therefore, destructive criticism leads to three predictable consequences: (1) It produces negative feelings among recipients and can initiate or intensify conflict among individuals, (2) it reduces the preference of recipients for handling future disagreements with the giver of the feedback in a conciliatory manner (e.g., compromise, collaboration), and (3) it has negative effects on self-set goals and feelings of self-efficacy (Baron, 1988). Needless to say, this is one type of communication that managers and others would do well to avoid.
Set Mutually Agreeable and Formal Goals
It is important that a formal goal-setting plan be established during the appraisal interview (DeNisi & Kluger, 2000). There are three related reasons why goal setting affects performance. First, it has the effect of providing direction—that is, it focuses activity in one particular direction rather than others. Second, given that a goal is accepted, people tend to exert effort in proportion to the difficulty of the goal. Third, difficult goals lead to more persistence(i.e., directed effort over time) than do easy goals. These three dimensions—direction (choice), effort, and persistence—are central to the motivation/appraisal process (Katzell, 1994).
Research findings from goal-setting programs in organizations can be summed up as follows: Use participation to set specific goals, for they clarify for the individual precisely what is expected. Better yet, use participation to set specific, but difficult goals, for this leads to higher acceptance and performance than setting specific, but easily achievable, goals (Erez, Earley, & Hulin, 1985). These findings seem to hold across cultures, not just in the United States (Erez & Earley, 1987), and they hold for groups or teams, as well as for individuals (Matsui, Kakuyama, & Onglatco, 1987). It is the future-oriented emphasis in appraisal interviews that seems to have the most beneficial effects on subsequent performance. Top-management commitment is also crucial, as a meta-analysis of management-by-objectives programs revealed. When top-management commitment was high, the average gain in productivity was 56%. When such commitment was low, the average gain in productivity was only 6% (Rodgers & Hunter, 1991).
As an illustration of the implementation of these principles, Microsoft Corporation has developed a goal-setting system using the label SMART (Shaw, 2004). SMART goals are specific, measurable, achievable, results based, and time specific.
Continue to Communicate and Assess Progress Toward Goals Regularly
When coaching is a day-to-day activity, rather than a once-a-year ritual, the appraisal interview can be put in proper perspective: It merely formalizes a process that should be occurring regularly anyway. Periodic tracking of progress toward goals helps keep the subordinate’s behavior on target, provides the subordinate with a better understanding of the reasons why his or her performance is judged to be at a given level, and enhances the subordinate’s commitment to effective performance.
Make Organizational Rewards Contingent on Performance
Research results are clear-cut on this issue. Subordinates who see a link between appraisal results and employment decisions are more likely to prepare for appraisal interviews, more likely to take part actively in them, and more likely to be satisfied with the appraisal system (Burke et al., 1978). Managers, in turn, should pay careful attention to these results. In short, linking performance with rewards in a clear and open manner is critical for future performance improvements (Aguinis, Joo, & Gottfredson, 2013; Ambrose & Schminke, 2009).

Continue to order Get a quote

Who would be able to complete this discussion?

Products

Recent Posts

Calculate the price of your order

Our guarantees

Money-back guarantee

Zero-plagiarism guarantee

Free-revision policy

Privacy policy

Fair-cooperation guarantee