Evaluating gamification

Gamification is the application of game design principles and mechanics to systems that are not themselves games, with the aim of enhancing user engagement, motivation, and task performance. Evaluating whether a gamified system actually achieves those aims is not straightforward. The most direct method would be a longitudinal study tracking user behaviour over months or years, but this is rarely practical within the timeframe of an academic project.

The solution is to combine three complementary approaches: heuristic evaluation by domain experts, psychometric measurement of the user experience, and behavioural analytics drawn from event data. Together, these create a high-density picture of how a gamified system affects its users, and together they compensate for the limitations that each would have individually.

Before selecting specific instruments, however, it is necessary to understand the theoretical frameworks that underpin gamification design. Without this understanding, evaluation findings are difficult to interpret.

Theoretical frameworks

Self-Determination Theory

The most widely used theoretical basis for evaluating gamification is Self-Determination Theory (SDT), which holds that sustained motivation depends on satisfying three innate psychological needs. Autonomy is the need to feel in control of one's own actions. Competence is the need to feel capable and to experience growth. Relatedness is the need for meaningful connection with others. When a system satisfies all three, users tend to move from extrinsic motivation — acting in order to obtain a reward — towards intrinsic motivation, where the activity is rewarding in itself.

This distinction matters for evaluation because extrinsic motivation is brittle. A system that relies entirely on points, badges, and leaderboards risks triggering the overjustification effect, in which the introduction of external rewards gradually diminishes whatever pre-existing interest the user had. An evaluation should therefore assess not only whether users are engaged, but whether the system's design supports the kind of engagement that is likely to persist.

The Octalysis framework

The Octalysis framework developed by Yu-kai Chou offers a practical tool for analysing the motivational architecture of a gamified system. It identifies eight Core Drives that stimulate human action, arranged on two axes. The first axis distinguishes white hat drives, which are positive and empowering, from black hat drives, which create urgency through scarcity, unpredictability, or fear of loss. The second axis distinguishes drives that are primarily extrinsic (left-brain) from those that are primarily intrinsic (right-brain).

White hat drives include Epic Meaning and Calling (believing one is part of something larger), Development and Accomplishment (progress and skill-building), Empowerment of Creativity and Feedback (creative expression with visible results), and Ownership and Possession (attachment to assets or progress). Black hat drives include Scarcity and Impatience, Unpredictability and Curiosity, Loss and Avoidance, and Social Influence and Relatedness. A system weighted heavily towards black hat drives may produce high short-term engagement but is unlikely to sustain it, and may generate resentment over time.

Octalysis also identifies four phases of the user journey — Discovery, Onboarding, Scaffolding, and Endgame — and prompts the question of whether the system's gamification features are effective across all four or only during the initial period of novelty.

User archetypes

Not all users respond to the same gamification features, and this has practical implications for how you recruit and segment participants in an evaluation. The Hexad model identifies six user archetypes based on their primary motivations.

Archetype	Primary motivation	Preferred game elements
Philanthropists	Purpose and altruism	Gifting, knowledge sharing, administrative roles
Achievers	Mastery and competence	Progress bars, levels, challenges, badges
Socialisers	Relatedness and connection	Guilds, team tasks, shared leaderboards
Free Spirits	Autonomy and exploration	Exploratory tasks, customisation, Easter eggs
Players	Extrinsic rewards	Points, virtual economy, lotteries
Disruptors	Change and influence	Voting, modding, anonymous contribution

If your participant group is dominated by one archetype, results may not generalise. For example, a system that performs well with Achievers may receive poor ratings from Socialisers if competitive features dominate at the expense of collaborative ones. Using the Hexad scale to profile participants before the study allows you to account for this when interpreting results.

Heuristic evaluation

Heuristic evaluation is an inspection method in which a small group of evaluators — typically three to five — independently assess a system against a set of validated design principles, then aggregate their findings to identify and prioritise issues. It is particularly valuable in short-term research because it can be carried out before or alongside user testing, and it does not require large numbers of participants.

Tondello et al. (2019) developed the Gameful Design Heuristics (GDH), a set of 28 heuristics organised into intrinsic motivation affordances, extrinsic motivation affordances, and context-dependent affordances. The table below shows a representative selection.

ID	Heuristic	Guiding question
I1	Meaning	Does the system help users understand the benefits of their actions?
I3	Increasing challenge	Do challenges grow in difficulty as the user's skill increases?
I4	Onboarding	Is there a tutorial that is both engaging and appropriate for newcomers?
I6	Progressive goals	Are immediately achievable next goals always visible?
I8	Choice	Does the system offer multiple paths for achieving results?
I9	Self-expression	Can users personalise their presence or create content?
I11	Social interaction	Are social interactions meaningful relative to the application's goals?
I13	Social competition	Can users compare themselves or challenge others?
E1	Ownership	Is there an evolving profile or set of virtual goods to acquire?
E2	Rewards	Are incentives proportional to the effort and time invested?
E4	Scarcity	Are some features or rewards rare or difficult to obtain?
C3	Graspable progress	Is the user's current progression and next step clearly communicated?
C5	Varied rewards	Are some rewards randomised to provide unexpected variability?

The evaluation process involves each evaluator working through the application independently, recording violations against each heuristic and assigning a severity rating. The results are then combined, duplicate issues are merged, and the aggregate list is prioritised by severity. This structured approach tends to surface motivational design issues that general usability heuristics would overlook.

Psychometric measurement

Heuristic evaluation identifies design problems, but it cannot capture what the experience actually feels like to a user. Validated psychometric instruments address this by providing a systematic way to quantify subjective states such as engagement, immersion, and flow.

The User Engagement Scale (UES-SF)

The User Engagement Scale short form (UES-SF) is a 12-item questionnaire that measures engagement across four dimensions: Focused Attention (absorption and loss of time awareness), Perceived Usability (ease of use and absence of frustration), Aesthetic Appeal (visual and sensory attractiveness), and Reward Factor (interest, value, and a sense that the time was well spent). Items are rated on a five-point scale.

To score the instrument, reverse-code the three Perceived Usability items (which are worded negatively) and then calculate the average score for each subscale. A total engagement score can be obtained by averaging all 12 items. The subscale scores are more informative than the total alone, because they indicate which aspect of the experience is driving or suppressing engagement.

The Game Experience Questionnaire (GEQ)

The Game Experience Questionnaire (GEQ) provides a more detailed view of the player's in-session experience. Its core module measures seven components on a five-point Likert scale: Immersion, Flow, Competence, Positive Affect, Negative Affect, Tension, and Challenge. For student projects, the In-game Module — a condensed 14-item version — is a practical alternative that can be administered at multiple points during a session to track how the experience changes over time.

Both instruments should be administered immediately after the interaction session, before the participant has had time to reflect in ways that might alter their responses.

Experimental design

Because gamification studies typically involve human participants, you will need to decide between a within-subjects design and a between-subjects design. The table below summarises the key trade-offs.

Design	Description	Advantages	Disadvantages
Within-subjects	Each participant experiences all conditions	Fewer participants needed; individual differences controlled	Risk of order effects (fatigue, learning, or carryover)
Between-subjects	Each participant experiences one condition only	No knowledge transfer between conditions	Requires more participants; higher variance

For a student project, a within-subjects design is generally preferable because it requires fewer participants to achieve the same statistical power. The main risk is order effects — the possibility that completing one condition first biases performance or ratings in the second. The standard mitigation is counterbalancing: half the participants complete the gamified condition first, while the other half complete the control first. This ensures that any order effects are distributed evenly across conditions rather than being confounded with the treatment.

For guidance on selecting the appropriate statistical test for your design, see the section on statistics.

Recommended evaluation process

present_to_allRecommended evaluation process

The five phases below provide a structured approach that integrates the methods described in the preceding sections.

Phase 1: Planning and scoping. Define the specific behaviour you are trying to change or support (for example, time spent on a task, or completion rates for a learning activity). Identify the gamification features that are intended to drive that behaviour, and select the measurement instruments you will use — at minimum, one heuristic tool and one psychometric instrument.

Phase 2: Heuristic analysis. Before involving participants, conduct a GDH inspection with a small group of evaluators. Use the results to form hypotheses about which features are likely to succeed and which are likely to cause friction, and record these so that you can test them against the user data in Phase 4.

Phase 3: Controlled user interaction. Recruit participants and, if possible, use the Hexad scale to identify their user type. Have participants complete tasks under both the gamified and non-gamified conditions (within-subjects design), using counterbalancing to manage order effects. Administer the UES-SF and GEQ immediately after each session.

Phase 4: Quantitative triangulation. Analyse any event data or behavioural logs alongside the survey responses. Look for cases where what users report does not match what they did, and investigate whether the discrepancies correspond to features flagged in the heuristic evaluation. Tools such as Mixpanel and Amplitude provide funnel analysis and retention reporting that are well suited to this kind of investigation.

Phase 5: Synthesis and recommendations. Map the findings from all three methods against the theoretical frameworks introduced at the start of this page. For each finding, consider whether it reflects a problem with the motivational architecture (best addressed through the Octalysis or SDT lens), a design execution problem (best addressed through GDH findings), or a user-type mismatch (best addressed through the Hexad profile of the participant group).