man with laptop UN069

Developing Assessments: Who Makes Up These Questions Anyway?

In this blog, we provide you with a guide for developing assessments—different types of assessment and what distinguishes a good assessment.

A lot can depend on an assessment. How you perform on an assessment can determine what school you are able to attend. Personality assessment results can influence if you get hired for a job. Scoring well or poorly on an assessment can mean the difference between being licensed to operate a vehicle and not. But, frustratingly enough, we have all encountered assessments that have made us wonder if they are really accurate or fair. We may think, “Who makes up with these questions?”. Television shows on how things are made fascinate some people. There are shows on how various foods are prepared, how furniture is made, how clothes are designed, and even on the making of romantic relationships. But, despite their impact on our lives, I have never run across a show on how assessments are properly made. So, let’s take a look at what makes a good assessment.

Types of assessment

Before we begin talking about how a good assessment is made, we need to distinguish between two types of assessments. First, there are the “for fun” quizzes. These are what you commonly see on social media. Generally, do not expect much scientific rigor to have been put into developing these. Some, by their titles such as “What kind of animal would you be?”, clearly let you know that you should not take the results too seriously. Others may masquerade as serious assessments with titles such as “What does your choice in [fill in the current fad here] reveal about your personality?” Be cautious of these “for fun” quizzes posing as serious assessments! Second, there are “high stakes” assessments whose results can have a significant impact on our lives. These tests demand scientific rigor in their development. A good high-stakes assessment has five characteristics. We will talk about these more in the next section.

What makes a good assessment

1. First, an assessment should measure what it says it measures. This is construct validity. Test developers must decide what it is that they intend an assessment to measure. For example, if an assessment is designed to measure sales focus, the developers must determine what behaviors, thoughts, attitudes, feelings, and reactions represent sales focus. This defining process may involve reviewing past research, interviewing individuals successful in sales, and reading trade articles discussing successful trends in sales.

For more objective characteristics such as knowledge of a programming language, this process is called content validation. Subject matter experts may be interviewed, or textbooks/technical documents reviewed to determine the topics to cover. After the construct to be measured has been defined, questions that reflect particular expressions of the construct are written, reviewed, and refined. Once the questions are tried out on the target population, the relationship between the assessment scores and scores on other assessments should be examined.

If scores on the new assessment are related to other characteristics that we would expect them to be related to (e.g., sales focus is related to a drive to achieve) and are not related to characteristics we would not expect them to be related to (e.g., sales focus not being related to math skills), then there is evidence that the new assessment measures what it is intended to measure. If unexpected relationships are found, then the assessment questions should be refined.

2. Second, an assessment should give the same or highly similar results when people take it more than once. This is reliability. Reliability can be built into an assessment in a couple of ways. Within an assessment, each question score should have a relationship (i.e., correlation) with the overall score. If the score on an item is always low when the overall score is high or high when the overall score is low, that item should be removed. Additionally, the assessment can be given to individuals on two separate occasions that are close enough together that you would not expect any change in the construct measured. If a lot of people change their response to a question, then that question should be removed from the assessment. 

An assessment should measure what it says it measures.

3. Third, an assessment should produce scores that are suitable for their intended use or interpretation. This is called criterion validity. If an assessment is designed to predict success in a sales job role, then it should be shown that people scoring high on the assessment also have higher sales than those who score lower. If a score on a question was not related to higher sales, then it should be removed from the assessment. It should be noted that an assessment in itself is not “valid.” It is the use or interpretation of the scores that need to have validity. Scores on a sales focus assessment may be valid in predicting sales performance, but not be valid for predicting extrovert behaviors.

4. Fourth, an assessment should not give an unfair advantage or disadvantage to a group of people based on a characteristic the assessment is not intended to measure. Questions should not contain references to situations, cultural norms, knowledge, or experiences that one or more subgroups (e.g., race, gender, culture) generally have not had access to and that are not directly related to the construct being measured. For example, an assessment measuring a preference for outdoor work could include a question asking if the individual likes to garden. However, this could disadvantage a group who disproportionally live in apartment complexes. Before final inclusion in an assessment, questions should be tested to see if any group consistently scores better or worse on the question than other groups.

5. The fifth characteristic could be considered a bonus to have. A good assessment has the appearance of measuring what it is designed to measure. This is called face validity. For example, if an assessment is designed to measure teamwork skills in the workplace, but all the questions are about activities outside the workplace the assessment may have low face validity. Face validity helps those taking the assessment to accept and believe the results.

To be properly used, a good assessment still needs to be administered correctly and the results interpreted and applied correctly. But that is a discussion for another day.

Developing assessments requires a proper scientific approach. SHL’s assessments are rigorously developed following the current scientific best practices. If you have any questions about our assessments, please contact us.

headshot eric popp


Eric Popp

Dr. Eric Popp is a managing research scientist at SHL. He directed the selection process for an international non-profit organization for 10 years prior to attending graduate school. He received his Ph.D. in Applied Psychology from the University of Georgia in 2004 after which he spent two years teaching at Eastern Kentucky University. In 2006 he joined SHL where he has been involved in multiple areas including validation and business outcome studies, development of cognitive and personality item content, localization of assessments, development of branching, animation-based SJTs, ROI estimations, development of competency-model-based job-analysis content and competency proficiency levels.

Explore SHL’s Wide Range of Solutions

With our platform of pre-configured talent acquisition and talent management solutions, maximize the potential of your company’s greatest asset—your people.

See Our Solutions