AI/ML @ AEM

Planning an AI Project is Planning (How) to Fail

Written by Stephan J. Lemmer, Ph.D. | Apr 10, 2025 11:33:52 PM

Public discussions around broken AI or fixing AI are in line with how computers have performed in recent history: either you get what you need, or a result is returned that obviously indicates a failure.

While an intuitive perspective, this is in direct contrast with how AI practitioners view ML-based tooling,[1] and we are not subtle about it.

Figure 1: Performance of various OpenAI models on the MMLU benchmark. Screen grab from https://openai.com/index/openai-o1-mini-advancing-cost-efficient-reasoning/.

Figure 2: Performance of various OpenAI models on the GPQA benchmark. Screen grab from https://openai.com/index/openai-o1-mini-advancing-cost-efficient-reasoning/

Figures 1 and 2 show the performance of various OpenAI models on MMLU and GPQA---challenging multiple-choice benchmarks in Math and Science. The results are astounding: experts who are pursuing PhDs in the relevant domains achieve 65% accuracy on the GPQA dataset, a full 12 points below that of o1.

However, o1 is still incorrect a full 22.7% percent of the time.

There are two things I want to point out here (and neither of them are “AI isn’t good”):

1) The problems matter: While o1 is incorrect on GPQA 22.7% of the time, it is only incorrect 7.7% of the time on MMLU.
2) Even though a 22.7% error rate is better than even expert humans, it may still not be acceptable in practical use.

So what do we do about this? We accept, mitigate, and measure our failures.

Accepting the Inevitability of Failure

AI is typically discussed and evaluated through the lens of facts: the AI has solved this math problem incorrectly or the AI has incorrectly identified the capital of Paris or some similar case where there is a clear question and clear answer. Domain-specific datasets---such as the MMLU and GPQA datasets discussed above---are proposed, developed, used to compare models, then thrown away as they become obsolete. This process is both useful and necessary[2], but creates an incorrect belief that we will achieve a “perfect” AI in the near future.

In practice, we do not use ML-based AI methods strictly in cases strictly related to fact retrieval. We expect these models to handle ambiguities in what they are being asked to do, how they are being asked to do it, and how to handle the various tradeoffs that are required in order to be useful.

There are a few different ways in which these tradeoffs occur. One of the most impactful is the use of ML-based AI tools to make judgments that are fundamentally subjective or uncertain. Consider the case of content moderation: some form of AI is necessary, as the scale of modern social media means that it is impossible to review all posts, tweets (exes?), and videos that are uploaded[3]. While content moderation is an easy task (a yes or no decision) on the surface, it turns out that the definition of acceptable varies between communities. Even within educated, homogenous communities, there is often disagreement between moderators. Similarly, ML tools are expected to make predictions on outcomes that are fundamentally uncertain---loan repayment, recidivism, or even the outcome of sporting events. There will always be incorrect answers due to the number of unknowable factors that can influence the outcome (the block was clean).

Similarly, ambiguities in the problem statement may set the AI up for failure, particularly if the user is able to talk (or type) to the model. Consider Figure 3: when the model is asked to identify third vase from left, it is unclear whether the half-vase on the edge counts.[4]

Figure 3: An example of an ambiguous request---do we include the half-vase on the left? From the RefCOCO dataset.

In human-human interactions, we handle these kinds of ambiguities by asking for additional information. This is a promising (and challenging) avenue of research, but it introduces an important tradeoff: how ambiguous does a case need to be before we review it? Tradeoffs like this are everywhere in AI. Sometimes the tradeoffs can be set explicitly with mathematical assurances (some approaches to content moderation), sometimes you know the tradeoff but you can’t control it (any time you work with LLMs), and sometimes the concepts are too abstract to formally measure (common in safety and alignment training).

Formalizing Your Problem

I like to say that the best AI is the best AI for your problem, but there is a challenge in knowing what your problem is. The general-purpose nature if instruction-tuned LLMs has dulled this challenge somewhat by making it feasible to experiment with small amounts of data and some creative prompting, but this is a double-edged sword. On the beneficial side: a tool that used to take months to build can now be created in a matter of hours without collecting large amounts of data or worrying about bespoke loss functions and tuning. On the downside: the limitations of non-LLM tools, the process of collecting datasets, and the selection of architectures used to force you to ask the crucial questions of what exactly am I doing and how do I know if I’m doing it right? Even though these questions are no longer strictly necessary to create an AI system that works, they remain critical in creating an AI system that performs well, is safe, and can handle the inevitable failures.

Measuring Failures

Although it is a safe assumption that your AI will eventually produce incorrect output, it is important to know how often this will occur. While there is a general (and not incorrect) belief that AI is getting better and better, performance still varies widely between tasks. Consider Figures 1 and 2 again: How comfortable do you feel saying to your manager, customer, or governance team I haven’t checked on our data, but on other data our system is correct between 53% and 89% of the time?[5] For this reason, it is important to use data and metrics that resemble your real-world use case as closely as possible. Without this, you won’t be able to report how well your AI performs---a key component in governance, from our perspective---nor will you be able to improve it.

Mitigating Failures

At this point, you know both what an incorrect answer looks like and how often it will occur. It’s time to ask what are the consequences of an incorrect answer and what do I do about it?

A popular approach is to suggest or require a human to review the results: I consider this a last resort.[6] Instead, an ideal solution will use the ML components in ways and places where the effect of error is minimized. For example, extractive question answering tools are structured such that the answer comes verbatim from source data and can be highlighted directly in context. Modern navigation software (such as Google maps) are also clever with their use of ML techniques in concert with path planning algorithms that have theoretical guarantees. These estimators can improve (or degrade) overall performance by predicting things like as travel time, environmental impact, or an abstract notion of preferred route, but rely on deterministic algorithms to handle the most important thing: making sure you reach your destination.

[1] For clarity, I’m using ML throughout this article to refer to any method that can be thought of as being trained, including LLMs. I avoid using “AI” in this case (title excepted) because many algorithms considered AI in the broad sense have success guarantees. I consider ML to be subfield of AI, and generative AI to be a subfield of ML.

[2] Standardized benchmarks and challenges have many downsides, but the ability to compare proposed methods in a “fair” way has been a major contributor to the past fifteen years or so of AI progress.

[3] Back of the envelope calculation: One estimate states that YouTube has 500 hours of video uploaded every minute. This means that to review every video without falling behind, YouTube would need 30,000 employees watching videos around the clock (what a job!).

[4] The person who wrote the request did not include it when counting.

[5] The results for GPT-4o

[6] I have pretty strong feelings about this practice, as I believe it serves the function of shifting blame more than it serves the function of improving outcomes, but I will admit that practical alternatives in chatbot-style interfaces are hard to come by.

AEM's AI team stands out for our expertise in realizing the benefits of human-in the-loop approaches in deep learned systems, and we offer capabilities across a range of traditional ML areas. Contact us at ai@aemcorp.com to explore challenges your team is facing.