How well does the thinking skills survey measure what it claims to?

This is a report on a validation study using simulated responses grounded in a large, publicly available personality dataset.

Purpose

The thinking skills survey is designed to give each respondent a meaningful profile across three dimensions: critical thinking, creative thinking and systems thinking. For that profile to be useful, the survey must demonstrably measure what it claims to — the questions must collectively distinguish between people with genuinely different thinking tendencies, produce scores that span a useful range and yield consistent results across the three population variants (student, professional and executive).

Testing these properties normally requires collecting responses from a large sample of people who have also completed independent, well-validated measures of the same constructs — a process that takes considerable time and resource. A complementary approach, adopted here, is to simulate a large set of synthetic respondents whose thinking tendencies are derived from an established empirical source, and then ask whether the survey's scoring algorithm recovers those tendencies reliably. This is called model-based synthetic validation.

The analysis described below was not conducted to replace empirical validation with real respondents, but to test the instrument's structural properties and identify any design issues before such validation is attempted.

Method
Survey instrument

The survey consists of 15 scenario-based questions, each presenting two or three response options. Each option is associated with one of the three thinking dimensions (critical, creative or systems thinking). Questions 1 through 15. The maximum score per dimension is 10 points, though the forced-choice format means that high scores on one dimension necessarily imply lower scores on others.

The survey is run in three variants: student, professional and executive. These variants present the same underlying questions in scenario contexts appropriate to each population. The scoring structure is identical across all three variants, but the detailed feedback is tailored to suit the kind of scenarios relevant to each population.

Reference population

Synthetic respondents were generated by sampling from the IPIP Big-Five Factor Markers dataset, a large, publicly available collection of personality item responses collected online by the Open Source Psychometrics Project (Open Source Psychometrics Project, 2018). The dataset was filtered to retain participants with complete responses to all 50 personality items and plausible ages (16–75), yielding a usable pool of approximately 950,000 individuals. The five personality domains (Extraversion, Neuroticism, Agreeableness, Conscientiousness and Openness) were scored from individual items following the instrument's standard reverse-keying procedure (Goldberg, 1992) and converted to z-scores to place all five domains on a common scale.

Respondents were stratified into three subgroups by age to approximate the survey's three population variants: students (16–24), professionals (25–44) and executives (45–75). A random sample of 667±1 individuals was drawn from each stratum, giving a total simulated sample of N = 2,000.

Mapping personality to thinking dimensions

Each simulated respondent was assigned a latent score on each of the three thinking dimensions by applying a weighted combination of their Big Five domain scores. The weights were derived from published meta-analytic correlations between personality traits and the thinking constructs in question:

  • Critical thinking was modelled primarily as a function of high Openness (weighted r ≈ .40, reflecting the Intellect facet; Kaufman et al., 2016), high Conscientiousness (r ≈ .30; Noftle & Robins, 2007) and low Neuroticism (r ≈ −.20).

  • Creative thinking was modelled primarily as a function of high Openness (r ≈ .55; Feist, 1998) and high Extraversion (r ≈ .20; Batey & Furnham, 2006).

  • Systems thinking was modelled as a function of high Conscientiousness (r ≈ .25), Agreeableness (r ≈ .15) and moderate Openness (r ≈ .20), consistent with the perspective-taking and long-horizon orientation that characterises systemic reasoning (Dolansky & Moore, 2013).

These weights were applied as a linear combination to produce a latent score for each dimension, which was then standardised. The latent scores represent independent, empirically grounded estimates of each respondent's plausible profile — estimates derived without reference to the survey itself.

Simulating responses

For each question, the probability of a simulated respondent choosing each available option was computed using a softmax function over the latent scores of the dimensions associated with those options. A noise parameter was included to represent realistic response variability — the degree to which people's moment-to-moment choices do not perfectly reflect their underlying tendencies. The noise level was calibrated so that the implied test-retest correlation across the 15 scenario questions approximated r = .74, the reliability coefficient reported for a comparable forced-choice thinking skills instrument (Dolansky & Moore, 2013).

All simulated responses were then scored using the survey's standard algorithm, producing a CT, CR and ST score for each of the 2,000 synthetic respondents.

Results
Score distributions

Across the full simulated sample, mean scores were 4.12 (SD = 1.96) for critical thinking, 4.55 (SD = 3.22) for creative thinking and 4.33 (SD = 2.45) for systems thinking, all on a 0–10 scale. Score distributions for all three dimensions were approximately symmetric with slight positive skew (CT skew = .26, CR = .15, ST = .18), suggesting that the instrument is capturing variation across the mid-range of each construct rather than clustering at either extreme (see Figure 1).

Figure 1. Score distributions for each thinking dimension: critical (CT), creative (CR) and systems thinking (ST). These are shown separately by respondent variant: student, professional and executive. Distributions are plotted as density histograms to allow comparison across groups of equal size. Scores range from 0 to 10. The three variants produce closely overlapping distributions for all dimensions, indicating that the survey does not systematically favour one population group over another. (N = 2,000; student n = 666, professional n = 666, executive n = 668.)

Some floor effects were observed for creative thinking (14.7% of respondents scoring at or near zero) and systems thinking (6.0%), compared with 2.6% for critical thinking. Ceiling effects were modest across all three dimensions (CR: 7.9%, ST: 1.6%, CT: 0.4%). The creative thinking floor effect is consistent with the scoring structure: because the three dimensions share a fixed pool of points, a respondent with a strong preference for two dimensions will necessarily score low on the third.

Discriminant validity

For a three-dimensional instrument to be useful, the three scores should be meaningfully distinct from one another: a respondent who scores high on critical thinking should not automatically score high on creative or systems thinking simply because of how the instrument is constructed.

Because this survey uses a forced-choice format, some negative correlation between dimensions is mathematically unavoidable: points allocated to one dimension reduce the points available to the others. This is a well-understood property of forced-choice instruments, sometimes called ipsativity, and is not a design flaw. The key question is whether the inter-dimension relationships reflect genuine construct distinctiveness rather than a mere scoring artefact.

The correlation between critical and creative thinking was r = −.649, and between creative and systems thinking r = −.795. Both reflect the ipsative constraint combined with the substantive tendency for these construct pairs to compete for the same items. Notably, the correlation between critical and systems thinking was near zero (r = .054), suggesting that these two dimensions function as substantially independent constructs within the survey: respondents who score high on one are no more likely to score high or low on the other (see Figure 2). This pattern is consistent with theoretical accounts that position critical and systems thinking as complementary rather than opposing orientations (DeYoung et al., 2007).

Figure 2. Pearson correlation coefficients between the three dimension scores (N = 2,000). Negative correlations between dimensions reflect the expected ipsativity effect of the forced-choice format. The near-zero correlation between critical (CT) and systems (ST) thinking (r = .054) is consistent with these two constructs functioning as relatively independent orientations. All off-diagonal values are distinct from one another, supporting the discriminant validity of the three-factor structure.

Convergent validity

Convergent validity refers to the degree to which the survey's scores agree with independent measures of the same constructs. Because no external measure of CT, CR, and ST was directly available for comparison, the latent scores derived from Big Five personality data served as an independent criterion: respondents with a high latent CT score were expected to score higher on the CT dimension of the survey, and so on.

Pearson correlations between each dimension's latent score and its corresponding observed survey score were r = .386 (p < .001) for critical thinking, r = .397 (p < .001) for creative thinking and r = .360 (p < .001) for systems thinking (see Figure 3). These moderate correlations indicate that the survey is capturing construct-relevant variation in a direction consistent with an independent, empirically grounded estimate. Correlations of this magnitude are typical in personality-adjacent measurement, where constructs are broad and where context shapes responses in ways that aggregate instruments do not fully predict.

Off-diagonal correlations — between each dimension's latent score and the other two observed scores — were consistently lower than the diagonal values, providing additional evidence of discriminant validity.

Figure 3. Relationship between each dimension's latent score (derived from Big Five personality data; horizontal axis) and its observed survey score (vertical axis), for critical thinking (CT), creative thinking (CR) and systems thinking (ST). Black lines show ordinary least-squares regression fits. Pearson correlations: CT r = .386, CR r = .397, ST r = .360; all p < .001, N = 2,000. The positive relationship in all three panels indicates that the survey recovers, at a statistically significant level, the construct-relevant variance represented by the independent personality-based estimates.

Profile composition

Respondents were classified according to their dominant dimension — the dimension on which they scored highest, provided the gap between the highest and lowest scores exceeded two points. Where the range was two points or less, respondents were classified as balanced. Across the full sample, creative thinking was the modal profile (39.6%), followed by systems thinking (29.5%), critical thinking (21.2%), and balanced (9.6%) (see Figure 4). The prevalence of these profiles was notably consistent across all three variants: differences in profile composition between student, professional, and executive respondents were small and non-significant on all three dimensions (CT: F(2, 1997) = 3.00, p = .050, η² = .003; CR: F(2, 1997) = 0.95, p = .389, η² < .001; ST: F(2, 1997) = 0.03, p = .971, η² < .001).

Figure 4. Profile prevalence by variant, expressed as a percentage of respondents. Profiles were classified based on the dimension with the highest score, provided the gap between the highest and lowest scores exceeded two points; respondents who did not meet this threshold were classified as balanced. Profile composition was highly consistent across all three variants, with no statistically significant differences (all η² < .003). Creative thinking (CR) was the most prevalent profile in all three groups.

Internal consistency

Internal consistency refers to the degree to which the items associated with a given dimension produce correlated scores — that is, whether a respondent who chooses the creative thinking option on one question tends to do so on others. Because this survey uses a forced-choice rather than a Likert-style format, standard internal consistency estimates (such as Cronbach's α) are not directly applicable. Instead, split-half reliability was computed by repeatedly dividing each dimension's items into random halves, correlating the two half-scores, and applying the Spearman-Brown correction to estimate full-scale reliability (Nunnally & Bernstein, 1994). This was repeated 500 times per dimension and the results averaged.

Reliability was strong for creative thinking (rSB = .860, SD = .008) and adequate for systems thinking (rSB = .691, SD = .036). For critical thinking, the estimate was below conventional thresholds (rSB = .486, SD = .079). This is worth noting as a design consideration, and suggests that additional CT-focused items with varied foils might strengthen the instrument's internal consistency for this dimension.

Cross-variant equivalence

A useful survey should produce comparable score distributions regardless of which variant a respondent completes. The results support this property. Mean scores across the three variants were similar for all three dimensions (see profile composition above), and the relationship between latent personality scores and observed survey scores was consistent across groups: convergent validity correlations ranged from r = .33 to r = .43 across variants and dimensions, with no systematic differences between student, professional, and executive subgroups.

Limitations

Several limitations of this analysis should be kept in mind when interpreting these results.

First, the analysis is model-based rather than empirical. The simulated respondents were generated from a mathematical model linking Big Five personality traits to thinking dimensions, using weights derived from published meta-analyses. That model is an approximation: the Big Five and the three thinking dimensions of this survey are related but distinct constructs, and the correlations assumed in the model carry uncertainty. The convergent validity estimates in particular depend directly on the accuracy of those weights, and should be interpreted accordingly.

Second, the IPIP dataset from which respondents were sampled was collected via an online self-selection process, and may not be representative of any specific population. Age-based stratification was used as a proxy for the student/professional/executive distinction, but age alone is an imperfect proxy for career stage and role level.

Third, the use of a forced-choice format introduces ipsativity — the mathematical constraint that high scores on one dimension must come at the expense of others. This limits the conclusions that can be drawn from inter-dimension comparisons and makes it inadvisable to interpret any single dimension's score in isolation from the others. All results reported here should be understood in the context of this constraint.

These limitations do not undermine the utility of the analysis as a pre-validation diagnostic. They do, however, reinforce the importance of conducting empirical validation with real respondents — ideally including independent measures of critical, creative, and systems thinking — as the survey accumulates use.

Summary

The model-based validation analysis provides reasonable grounds for confidence in the survey instrument's structural properties. Scores span a useful range, the three dimensions show the expected pattern of inter-correlations, observed scores are associated with independently derived personality-based estimates at a statistically significant level, and profile classifications are consistent across the three population variants.

The main finding warranting further attention is the lower internal consistency of the critical thinking dimension, which suggests that the items contributing to CT scores are more susceptible to response variability than those contributing to CR and ST. This is a design consideration to monitor as real-world response data accumulates.

Participants can be reasonably confident that their results reflect genuine tendencies in how they approach complex problems and decisions, while recognising that no self-report instrument provides a complete or final account of such tendencies.

References
  • Batey, M., & Furnham, A. (2006). Creativity, intelligence, and personality: A critical review of the scattered literature. Genetic, Social, and General Psychology Monographs, 132(4), 355–429. https://doi.org/10.3200/MONO.132.4.355-430

  • DeYoung, C. G., Quilty, L. C., & Peterson, J. B. (2007). Between facets and domains: 10 aspects of the Big Five. Journal of Personality and Social Psychology, 93(5), 880–896. https://doi.org/10.1037/0022-3514.93.5.880

  • Dolansky, M. A., & Moore, S. M. (2013). Quality and safety education for nurses (QSEN): The key is systems thinking. OJIN: The Online Journal of Issues in Nursing, 18(3). https://doi.org/10.3912/OJIN.Vol18No03Man01

  • Feist, G. J. (1998). A meta-analysis of personality in scientific and artistic creativity. Personality and Social Psychology Review, 2(4), 290–309. https://doi.org/10.1207/s15327957pspr0204_5

  • Goldberg, L. R. (1992). The development of markers for the Big-Five factor structure. Psychological Assessment, 4(1), 26–42. https://doi.org/10.1037/1040-3590.4.1.26

  • Kaufman, S. B., Quilty, L. C., Grazioplene, R. G., Hirsh, J. B., Gray, J. R., Peterson, J. B., & DeYoung, C. G. (2016). Openness to experience and intellect differentially predict creative achievement in the arts and sciences. Journal of Personality, 84(2), 248–258. https://doi.org/10.1111/jopy.12156

  • Noftle, E. E., & Robins, R. W. (2007). Personality predictors of academic outcomes: Big five correlates of GPA and SAT scores. Journal of Personality and Social Psychology, 93(1), 116–130. https://doi.org/10.1037/0022-3514.93.1.116

  • Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.). McGraw-Hill.

  • Open Source Psychometrics Project. (2018). IPIP Big-Five Factor Markers data [Data set]. https://openpsychometrics.org/_rawdata/

Acknowledgements

This synthetic validation was conducted using Claude Code, which used Python and Jupyter to complete the analysis. The report presented here is edited from a Claude-generated Jupyter Notebook summary of the results.