There’s no shortage of wonky assertions in the field of UX, specious claims based on bad methodology, spuriously supported by voodoo statistics. Jakob Nielsen, one of the biggest names in the field, isn’t immune. Perhaps most famously, the “five user-assumption,” which he originally proposed with Rolf Molich, has since been rather thoroughly debunked by Molich’s subsequent research. In another post I examined his claim that you only need 20 users in quantitative usability tests. Here I’d like to take a closer look at an extremely bizarre claim he made in a 2012 article.
In it he asks the important question of whether users are really more satisfied with designs that score higher on usability metrics. He claims that they are—70% of the time. So how does he know?
To be clear, we’re talking about the relationship between 1) users’ self-reported level of satisfaction with use and 2) objective usability performance metrics (time on task, task success rate, and number of errors). Nielsen looked at 298 designs where the Nielsen Norman Group had collected both these subjective self-report and objective behavioral measures. He then standardized the scores. So far so good. He then shows us the following unfortunately formatted graph.
Nielsen tells us the correlation between the usability performance metrics used and user satisfaction is…wait for it…r = .53.
You should now be doing a doubletake.
He just said users are more satisfied with designs that have greater usability 70% of the time. Now he’s saying the actual correlation is .53? So where the hell did the 70% come from? Well, hang onto your hats. It’s a strange bit of chicanery.
After nonchalantly reporting the actual correlation coefficient, Nielsen oddly drops the statistic altogether and bases his discussion on the scatterplot above. He argues that 30% of the dots are in the top-left and bottom-right quadrants, which are, respectively, where there was below-average usability coupled with above-average satisfaction, and above-average usability with below-average satisfaction. The remaining 70% of dots in the scatterplot are in the bottom left and top right. These “70% of dots” are his evidence that users prefer systems with higher usability scores “70% of the time.” (In the image below some better labels are added to make the graph easier to read.)
So…what’s wrong with this? Well, a lot actually.
Not only is Nielsen’s argument not a valid interpretation of the graph, but it also doesn’t align with the claim he is making. For instance, take a look at the dots called attention to below. These are all instances where usability was about average (close to a z score of 0 on the horizontal) and where satisfaction was nevertheless below average (below 0 on the vertical). They are also all in the bottom-left quadrant, which means Nielsen wrongly considers them instances of users preferring systems with higher usability metrics.
This is a small matter compared to the question of why he is focusing on the quadrants in the first place. What matters are the distances of the dots from the regression line itself—indeed that’s what a correlation is. One could move a lot of those dots closer to the regression line without changing the proportion that fall in each quadrant. This would nevertheless greatly increase the correlation.
All of this is counter to Nielsen’s reasoning. Out of the infinite number of lines you can draw through a pattern of dots in a scatterplot, the regression line, or “line of best fit,” is the single line that is closest to all the dots in the graph. The greater the spread of the dots, the weaker the relationship, period. It ultimately does not matter how many dots are in each quadrant.
Remember, Nielsen said the correlation coefficient, r, was .53. The effect size measure for r is the coefficient of determination, r2. This is a measure of the proportion of the variance in one variable accounted for by knowledge of another. Thus, what Nielsen actually found is that the usability metrics used accounted for 28% of the variance in user satisfaction (since r = .53, r2 = .2809). Obviously, this is a far cry from his claimed 70%.
But wait, it gets worse.
It also seems that Nielsen is ignoring the assumptions of correlation. Look at his scatterplot again. (Last time, I promise.) We don’t need the raw data to see the data are heteroscedastic. Notice the conical nature of the dots in the image below. This means there’s a much weaker correlation at one end than the other. If these data violate the assumption of homoscedasticity, then it’s probably an invalid result that shouldn’t have even been reported.
For the sake of discussion, however, let’s pretend it’s a valid result. If you were for some reason expecting a perfect overlap between usability and user satisfaction, as Nielsen suggests everyone would, then an effect size of .28 should be a shocking result. Why not write that up instead? But, again, that’s only if you happen to share Nielsen’s expectation.
After all, a shared variance of 28% is actually pretty darn good for any correlation between an objective, behavioral metric and a subjective self-report measure, which is ultimately what he was looking at.
Personally, I don’t see why anyone would expect usability to be that strongly related to satisfaction and would be perfectly happy with an r of .53, but that’s just me.