The Personality Test That Doesn't Know What to Do With You Is About to Get a Lot Smarter

Key Takeaways

Traditional DISC assessments oversimplify personality by forcing individuals into one of four categories, leading to misclassifications.
Recent machine learning research found that using algorithms can produce more accurate personality insights by capturing blended profiles instead of strict categories.
The study identified that a reduced 10-question assessment could still maintain high predictive accuracy of the traditional DISC method, demonstrating efficiency.
Machine learning allows for probabilistic outputs, revealing individual traits without rigid classifications, enhancing the understanding of complex personalities.
Despite improvements, the DISC framework still faces challenges regarding construct validity compared to more robust personality models like the Big Five.

The categories came first. That’s always how it works in personality psychology: you decide there are four kinds of people, or five, or sixteen, and then you build a questionnaire that sorts everyone into one of them. The category precedes the person. It has to, because without categories the whole exercise produces nothing anyone can act on. This is fine, as far as it goes, except for a problem that personality researchers have quietly acknowledged for decades: most people don’t fit cleanly into any single category. They hover at the borders. They’re a bit of this, heavily that, with traces of something else scattered through. The category says one thing; the person is more complicated.

Now a team of researchers has run a machine learning system across more than a thousand personality assessments, and what they found suggests that the categories themselves may have been doing most of the damage all along.

The test in question is the DISC assessment, probably the most widely used personality instrument in corporate life. You’ve possibly taken it. Most working adults have, at some point, answered forty questions about their behavioral tendencies and been handed a letter telling them they’re a D (Dominant), an I (Influential), an S (Steady), or a C (Conscientious). Human resources departments use DISC for recruitment, team formation, leadership coaching. Its appeal has always been its simplicity. The four categories are legible, actionable, easy to explain in a workshop. The problem, which the model has carried since William Marston conceived it in the 1920s, is that legibility and accuracy are not the same thing.

Why do most people feel like their DISC result doesn’t quite fit them?

Because it probably doesn’t, exactly. The traditional DISC method assigns everyone to whichever of four categories scored highest on their questionnaire, even when their scores are nearly equal across two or more dimensions. A person who is strongly both Dominant and Conscientious gets filed as one or the other, with the distinction erased. The new research found that clustering algorithms, left to group people without predefined categories, naturally identify the same four types but also expose significant overlap at the edges, which is where most people actually live.

Is a 10-question personality test actually reliable enough to use in hiring?

The new study found that a carefully selected 10-item subset of the standard 40-question DISC assessment retained over 91% of the full test’s predictive accuracy. The key word is “carefully selected”: the ten questions were identified by an algorithm trained to find the most informative items, not chosen at random. That said, 91% accuracy here means accuracy in replicating the old scoring method, not accuracy in predicting job performance or team compatibility. Those are harder questions the study doesn’t directly address.

What does machine learning actually add to personality testing that the old method can’t do?

Two main things. First, it can detect hybrid or blended behavioral profiles rather than forcing everyone into a single box. Instead of “you are an S type,” a machine learning system can tell you that someone is predominantly steady, with meaningful conscientious traits and some influence tendencies, which is a richer and arguably more useful description. Second, it can identify which questions in a long assessment are actually doing predictive work and which are redundant, making it possible to build shorter tests without sacrificing much accuracy.

Does this research mean DISC is actually scientifically valid now?

Not quite. The study demonstrates that machine learning can replicate and refine DISC classifications efficiently, but it validates the system against itself rather than against external behavioral measures. DISC has faced longstanding academic criticism for limited construct validity compared to frameworks like the Big Five personality traits. Making DISC faster and more nuanced doesn’t resolve those foundational questions. The researchers are explicit about this: they’re enhancing the tool’s practical utility, not settling the debate about whether its underlying categories reflect genuine psychological reality.

Traditional DISC scoring is about as blunt as classification gets. The system adds up a participant’s responses across each behavioral dimension, finds whichever one scored highest, and assigns that label. If your Conscientiousness score is 32 and your Dominance score is 31, you’re a C. The person scoring 31 in Dominance, who is basically as dominant as a Dominance type gets, disappears into the C category and stays there. The method assumes the four categories are mutually exclusive. They aren’t, really, and everyone involved has always known they aren’t.

Mohammad Hossein Amirhosseini, an Associate Professor in Computer Science at the University of East London, decided to see whether machine learning could do better. Working with colleague Fatima Kalabi, a clinician based at Queen’s Hospital in London, he assembled responses from just over 1,000 participants who had completed the standard 40-item DISC questionnaire, then ran six different machine learning models against the data.

The results were, depending on what you expected going in, either reassuring or somewhat unsettling. Logistic Regression and XGBoost both achieved 93.53% accuracy in replicating the traditional DISC classifications, with Logistic Regression pulling slightly ahead on cross-validation stability and interpretability. What this means, concretely, is that a trained algorithm can look at someone’s raw questionnaire responses and predict with high reliability how the old scoring method would have labeled them. The classification task isn’t hard, once you stop using blunt rules and let a system learn the underlying patterns. Random Forest and K-Nearest Neighbours lagged considerably, but four of the six models performed respectably without much coaxing.

More interesting than the accuracy figures was what happened when Amirhosseini’s team started removing questions. Using a technique called Recursive Feature Elimination, which iteratively strips out the least informative items from a dataset until only the most predictive remain, they identified a subset of just ten questions from the original forty that could replicate the full assessment’s classifications with 91.49% accuracy. The jump from 10 to 20 questions added barely half a percentage point of accuracy. From 20 to 40 questions, another point and a half. The returns diminish steeply, which tells you something important: most of the forty questions aren’t doing much work.

“A 10-question assessment tool that still captures the underlying personality structure would make these assessments far more practical in contexts such as recruitment, leadership development and team building,” said Dr Amirhosseini. He argued elsewhere that machine learning could retain DISC’s practical simplicity while adding analytical depth that the old scoring method simply cannot provide. The case for shorter assessments in organizational settings is partly about time. A forty-question form during a job interview is a significant ask; a ten-question version that delivers comparable results might actually get completed honestly rather than rushed through.

The clustering analysis is where the findings get philosophically interesting. The team applied unsupervised K-Means clustering to the responses, asking an algorithm to find natural groupings without telling it the DISC categories existed. The algorithm found four clusters. They aligned closely with the four DISC dimensions. Which is either a validation of Marston’s century-old framework (the categories emerge from the data, not just the scoring rules) or a sign that the questionnaire itself was designed to produce four-category data in the first place. Probably some of both.

What the clustering also revealed was something the traditional method was hiding: the edges. Between some clusters, particularly Assertiveness and Steadiness, the behavioral profiles showed notable overlap. People were genuinely occupying the territory between types, not being misclassified into the wrong type, but genuinely expressing traits that belong to more than one. The machine learning models can surface this. Instead of a single categorical label, they produce probabilistic outputs, so a person might come back as 60% S, 30% C, 10% I rather than simply S. That’s a different kind of information, and arguably more honest.

The reduced 10-item questionnaire, when clustered independently, produced the same four-cluster structure as the full 40-item version. The researchers confirmed this using distance comparisons between the cluster profiles of both models. The dominant behavioral dimensions within each cluster remained consistent. This matters because it suggests the shortened assessment isn’t just replicating labels; it’s preserving the underlying structure the labels were meant to capture.

There are real caveats. The sample, reasonably diverse in age and gender, was recruited through professional networks and online platforms, which introduces selection effects. More importantly, the study validated machine learning models against the traditional DISC scoring method rather than against any external measure of behavioral reality, so the 93.53% accuracy figure tells you the algorithm mirrors the old method well, not necessarily that the old method was right. The DISC model has been criticized by academic psychologists for decades for its limited construct validity compared to trait-based frameworks like the Big Five. Making DISC more efficient doesn’t resolve those foundational critiques.

What the research does suggest is that the part of personality assessment organizations have always found most frustrating, the people who don’t fit, might be exactly what machine learning is built for. A system that produces probabilistic outputs rather than forced categorical labels, that recognizes hybrid profiles rather than rounding them to the nearest type, represents a genuinely different approach to the question of what someone is like. Whether organizations will actually use that nuance, rather than simply demanding the letter back, is a different problem entirely.

DOI / Source: https://doi.org/10.52768/3067-7947/1037

Quick Note Before You Read On.

ScienceBlog.com has no paywalls, no sponsored content, and no agenda beyond getting the science right. Every story here is written to inform, not to impress an advertiser or push a point of view.

Good science journalism takes time — reading the papers, checking the claims, finding researchers who can put findings in context. We do that work because we think it matters.

If you find this site useful, consider supporting it with a donation. Even a few dollars a month helps keep the coverage independent and free for everyone.

The Personality Test That Doesn’t Know What to Do With You Is About to Get a Lot Smarter

Key Takeaways

Related

Leave a Comment Cancel reply

Key Takeaways

Frequently Asked Questions (FAQ)+

Related

Leave a Comment Cancel reply