Demand Accuracy in Your AI Tools: Lessons from Baymard Institute
Most AI-powered tools for UX lack reliability and accountability in their outputs. Demand transparency and proven accuracy, or don't buy it.
Demand Accuracy in Your AI Tools: Lessons from Baymard Institute
Kate Moran
January 30, 2026
Email article
Summary:
Most AI-powered tools for UX lack reliability and accountability in their outputs. Demand transparency and proven accuracy, or don't buy it.
AI-powered UX tools promise efficiency, but frequently lack transparency about their accuracy or limitations. In this episode of the NN/g UX podcast, Baymard cofoundersChristian and Jamie Holst discuss why higher standards are needed for AI in UX and other professional domains and explain Baymard's approach to building its AI-powered ecommerce evaluation tool, UX-Ray.
###
In This Article:
Meet Christian and Jamie Holst, Cofounders of Baymard Institute
AI Tools for UX Lack Accountability
"Pretty Good" Isn't Good Enough for Product Design
Small Details Matter Immensely
Accuracy Comes First: The UX-Ray Case Study
Use GenAI Only Where It's Reliable
Meet Christian and Jamie Holst, Cofounders of Baymard Institute
Christian and Jamie Holst are brothers and cofounders of the Baymard Institute, an independent ecommerce-specialized UX organization that has produced hundreds of practical guidelines for ecommerce design. Like NN/G, Baymard has a strong focus on research-backed design guidance, but its work specifically focuses on ecommerce.
Christian is Baymard's research director, overseeing all UX research. Jamie is CTO, responsible for Baymard's technical development and known for his writings at the intersection of technology and UX.
AI Tools for UX Lack Accountability
Many AI-powered UX tools promise heuristic evaluations, UX audits, or instant insights. However, most fail to disclose how accurate those outputs are — or when they should not be used. This trend was particularly egregious in 2023, but even as AI technology has improved, many tools marketed for UX work still lack reliability.
Christian and Jamie pointed out that most AI tools for UX fail to even mention (let alone actually measure and report) how their accuracy compares to human-produced outputs.
"The vast majority of tools don't even publish an accuracy rate. So, we can't even discuss whether it's high enough."
— Christian Holst
As an example, let's consider an AI tool that promises to "scan" your website, identify possible UX problems, and suggest solutions. On the surface, this sounds enticing — particularly to teams that have limited UX expertise or need to quickly evaluate many pages. What most tools like this hypothetical one fail to provide — and what many decision-makers are not even asking for — is transparency about their limitations.
An AI-powered tool designed to scan pages for identifiable UX problems is typically capable of detecting only some types of issues (e.g. poor visual contrast, inconsistent copy, or insufficient white space), rather than addressing all types of UX issues. GenAI-powered systems may struggle to identify deeper, subtler issues, like a fundamental mismatch between the site's information architecture and the expectations of its target audience. But how often have you seen such AI tools transparently acknowledge these kinds of limitations?
"Pretty Good" Isn't Good Enough for Product Design
In many cases, the limitations of AI UX tools are actually quantifiable. In 2023, when ChatGPT's GPT-4 model was released with the ability to process images, it unlocked the ability to use consumer AI tools to "audit" digital interfaces. Baymard's team put the new capability to test by using GPT-4 to conduct a UX audit of 12 webpages and comparing the tool's output to the issues identified by human experts. GPT-4's performance was abysmal — it had a 20% accuracy rate. 80% of the recommendations it made were false positives. It discovered only 14% of the UX issues present in the input screens.
Of course, since then, there have been substantial improvements made in model architecture, training data, and explainability. Some newer AI tools now offer more detailed rationales for their outputs, allowing users to see why a particular recommendation was made. Additionally, advances in finetuning and prompt engineering allow models to be customized for specific domains, potentially increasing relevance and accuracy for specialized UX tasks.
In March 2025, two Microsoft researchers conducted a similar evaluation of four AI tools. They found that the tools had accuracy rates ranging from 50% to 70%. For the sake of the argument, let's say that an AI-powered UX tool may now be capable of around a 70% accuracy rate.
Even this substantially higher accuracy rate could still be dangerous. Christian explained:
"At first glance, this might seem like an acceptable rate. I could easily imagine a CEO or CMO being presented with this and saying, 'Well, here's this cheap tool that is right about 70% of the time, we should just use it!'
The problem is that's actually a horrible value proposition. I think people who have been in the UX space a long time will recognize and understand that."
— Christian Holst
For example, consider that a heuristic-evaluation tool presents you with 10 suggestions to improve the experience. Seven of them are good recommendations… but three of them aren't. They may actively make your experience worse and decrease conversion. But you can't tell which recommendations are the good ones.
****
| Listen in on more conversations with industry leaders. Subscribe to the NN/G UX Podcast on Spotify, YouTube, or your favorite podcast app. |
|---|
Experienced UX professionals know that small design tweaks can have a major user and business impact. Christian and Jamie described how, by replacing dot indicators with thumbnail images in the product-image carousel, a recent client, a Fortune 500 apparel retailer, was able to increase conversion rate by 1%, generating millions of dollars in revenue.
[Nike product page showing vertical thumbnail navigation on the left, allowing users to preview all available product images at a glance.]
Nike's product-detail pages use thumbnails to represent available product images. (This example is for illustration purposes only — Nike is not the anonymous client described in the story.)
It's easy to imagine a reverse scenario, where a small recommendation might have a huge negative impact. What if the client in this example asked an AI tool to evaluate its product pages? The tool might say something like, "Replace the thumbnails with carousel dots to reduce visual clutter." If the team followed that recommendation unquestioningly, it may inadvertently decrease the conversion rate and not even realize why it happened.
Accuracy Comes First: The UX-Ray Case Study
Frustrated by the proliferation of unreliable AI tools on the market, Christian and Jamie built their own, which they call UX-Ray.
[Baymard's UX-Ray interface displaying automated UX audit results with 95% accuracy, showing identified issues on an ecommerce site.]
Baymard's UX-Ray tool can identify where an ecommerce design appears to fail to adhere to some of its guidelines.
The tool specifically checks ecommerce sites against a subset of Baymard's hundreds of guidelines. (At the time of writing, UX-Ray checks for only 154 guidelines (~20-25% of the full set of 700+ ecommerce guidelines).
[...]