LLMs, Explanations, and Appropriate Trust

AI-based tooling has demonstrated superhuman results on a variety of different tasks, from the creative to the critical. It has also demonstrated the tendency to make mistakes that are biased or just downright baffling.

It is commonly proposed that a human (or a group of humans) has the final say on any AI-based decision. While the methods and merits of this mitigation are debatable (a subject for another time) what is not debatable is the importance of appropriate trust: In order for human oversight to function, the individuals providing oversight must be given the appropriate tools and motivations to correct the model when it is wrong and trust it when it is correct.

Establishing appropriate trust within a human-AI team is complex. The ability of a such a team to function at a high level is dependent on a variety of factors: the task at hand, the human and AI’s respective capabilities on said task, and the human’s underlying beliefs about AI, to name a few.

In this post, we will highlight one factor: the way in which the model communicates its reasoning and confidence to the user.

Confidence and Explainability

By knowing how sure our model is (confidence) and how it came to a conclusion (explainability) the user can decide when to trust the model and when to review its output. The notion of confidence is pretty straightforward: the model should accurately communicate---via words or numbers---how likely it is to be correct. Explainability, on the other hand, is a bit more nuanced, as it consists of a balance between faithfulness---how true an explanation is---and interpretability---how understandable an explanation is to the human operator.

Figure 1: An image showing which pixels -- if changed -- would most affect the output of the word man. From Women Also Snowboard: Overcoming Bias in Captioning Models.

At the extremes of these two factors you get useless explanations: a perfectly faithful explanation might be because I multiplied these 100 billion numbers with these 2,000 numbers in this order, while a perfectly interpretable answer would be I classified this as a picture of a dog because there is a dog in the picture.

The goal of an explainability method is to strike a useful balance. For example, Figure 1 highlights locations that have strongly contribute to generation of the word “man." The graphic is neither perfectly faithful nor perfectly understandable: negative influences on the word man and positive influences on other words aren’t included, and there is a meaningful distinction between this is the predominant contributor in a partial linearization of the deep network[1] and this is why the decision was made.

Despite the shortcomings, it is still useful: for someone reviewing model outputs it would signal the use of irrelevant data and a need for further scrutiny, while for an ML practitioner it would indicate a spurious correlation that needs to be removed.

Because of the wide variety of applications, there is no single best method for balancing between faithfulness and interpretability: while the previous method tells you what is important to the model, it does not definitively tell you if removing the tennis racquet would change the answer.

An alternate explainability approach---most commonly used for a small number of inputs---is counterfactual explanation, which explicitly states how to change the input such that the decision would change. An example of this would be if your income had been $10,000 higher, your mortgage would have been approved.

Although highly interpretably and mostly faithful, there is still one shortcoming related to faithfulness: for any given decision, there may be an infinite number of valid counterfactuals. For example, in the simple case of mortgage approval based on loan amount and income, valid counterfactuals may be: 1) if your income had been $10,000 higher…; if your loan amount had been $30,000 lower…; 2) if your income had been $5,000 higher and loan amount $15,000 lower; 3) if your income had been $2,500 higher and loan amount $7,500 lower, etc. Some criteria must be chosen to decide which counterfactuals to show, potentially hiding information the user would find important.

Basically, explanations are hard.

Figure 2: Because LLMs do not have access to their internal state, they rely on mimicking human outputs to produce confidence values. This results in outputs that are plausible, but not faithful---a dangerous combination. Note that this specific example reflected LLM capabilities at time of writing. Models have since been trained on this specific type of problem but the core principle holds

How LLMs Communicate Confidence and Explainability

An intuitive interaction when an LLM gives a confusing answer is to use natural language to ask how sure it is---confidence---or for more information---an explanation (Figure 2). These explanations and confidence values will be both interpretable and plausible, typically leaving the user satisfied and contributing to algorithmic bias.

However, neither the explanation nor the confidence value are faithful: LLMs give an answer that you will understand and believe (because that is what they are, first and foremost, trained to do), but not necessarily one that is true. For example, if you include provide your answer and your confidence in this answer in your prompt, GPT-4 model would be wrong about half the time it says it is 95% sure (Figure 3).[2]

Figure 3: While LLMs can produce believable confidence measures, these correlate poorly with actual accuracy. From Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs.

One plausible reason for this is simply due to the nature of LLMs as next-word predictors trained on internet data, as people generally express confidence using numbers rounded to the nearest 5%, and often for emphasis more than truth. For example, when you can’t find your keys, you might say I am 100% sure I put the keys back on the hook, even if you didn’t. The model doesn’t learn to express its confidence: it learns people like to say 100%.

A similar problem affects explainability: the model produces plausible outputs when asked for reasoning because it has seen many cases of a person justifying an answer they have given or providing reasoning before giving their answer. However, since LLMs do not have access to their own internal state---the way they actually made the decision---they are saying what they would expect a human to say, not providing a faithful reflection of their thought process. This has been highlighted in some concerning ways: one work found that when switching sensitive features such as race and gender: the model would change its answer to align with a biased response and generate a new---plausibly unbiased---rationale for the new response (Figure 4).

Figure 4. Although information unrelated to race and gender remains unchanged, both the blue and red inputs result in the answer (C) with plausbile rationale. From Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting.

What Does This Mean for My Software?

Explainability and confidence techniques are meant to help both individual users and AI practitioners diagnose problems and intervene when necessary. Through this lens, the combination of high interpretability, high plausibility, and low faithfulness of explanations in LLMs is a worst-case scenario for both human-AI teaming and responsible AI.

It is reasonable to feel discouraged based on this but, like most AI problems, handling these challenges depends on your specific needs, goals, and tolerance for risk. Are you brainstorming ideas for a fantasy novel? Go nuts with an LLM. Are you reviewing mortgage applications? A large body of research exists on explainable and interpretable non-LLM techniques[3], consider one of those even if it’s slightly less accurate.

Are you reviewing fantasy novels? There are worse things in life than reading a bad book, but maybe don’t give it to your kid without reading it first.

[1] Don’t worry about it.

[2] Specific numbers are subject to when that paper was written, the dataset used, the specific prompt, etc., so take that into consideration.

[3] As with many things in AI, specifics matter: GPT-4 is an important benchmark measurement because of its widespread adoption, but some models and datasets together have shown good calibration and some works explicitly train models to output confidence. Broadly, the message is not to take these values for granted.

AEM's AI team stands out for our expertise in realizing the benefits of human-in the-loop approaches in deep learned systems, and we offer capabilities across a range of traditional ML areas. Contact us at ai@aemcorp.com to explore challenges your team is facing.

Meet Stephan J. Lemmer, Ph.D.

BLOG POST

LLMs, Explanations, and Appropriate Trust

AI-based tooling has demonstrated superhuman results on a variety of different tasks, from the creative to the critical. It has also demonstrated the tendency to make mistakes that are biased or just downright baffling.

Confidence and Explainability

How LLMs Communicate Confidence and Explainability

RECOMMENDED BLOG POSTS

Early Warning Systems: Detecting Public Concerns or Service Failures

Improving Grant Management by Analyzing Narrative Reports from ...

The Volume Problem: Strategies for Effectively Handling Millions of ...