Remarks for Data Foundation: Adding a Measure of Confidence to AI

Good afternoon. I lead the AI team for AEM Corporation. We work with several Chief Data Officers and Chief AI Officers on AI adoption.

Thank you to the Data Foundation for arranging this event. I am, our team is, as excited as anyone about the potential for AI.

I have been asked to highlight a challenge today and invite you to reconsider how you use large language models. I hope you will think differently about how you approach building and buying AI systems. If I’m successful, you may also consider policy changes for what we require of our AI systems in the future.

Here’s the challenge. Our federal workforce and our contractors – that’s hundreds of thousands of people – are using LLM-powered products for things they can’t do well. The LLM-powered products are still giving them answers, and the users assume that they are correct.

Of course, you don’t want bad data in your solicitations, in your communications to the public, and in your reports to Congress. The products tell you to double-check, but do you? Or can you?

Leading LLMs, including state of the art reasoning models, sound correct but they DO NOT understand their own confidence. It’s important to stress that they sound confident whether they subtly misstate an answer or come to an entirely incorrect conclusion based on correct and complete inputs. This applies to analyzing documents, summarizing comments, asking policy questions, and more. Chain of thought is not necessarily correct in reasoning models. LLMs sound confident even when they are wrong.

There’s another, related issue. Most LLM-powered products make no attempt to communicate their limitations. You have an open text input field and you get a response. You are expected to develop an intuition through repeated use for what they can and can’t do. This takes hundreds of hours of manual checking and data tracking, and even if it works well a few times, you may incorrectly assume it will work indefinitely as your data, task, or models change slightly. There’s no advance guidance and no guardrails if you use incorrectly or are given bad results.

These systems are being used by people who working hard on deadlines. Even where the answers may inform major decisions, there is a real risk that the user will simply trust AI and not do the actual work needed to double check the LLM’s outputs.

For this reason, education is not a substitute for better-designed AI systems. We think the real fix at scale is at the AI model and product level.

And we make three suggestions:

First, we should generally reexamine what types of desirable LLM outputs we want from the systems we build for government use cases. Currently, evaluating LLMs is not as straightforward as classical ML techniques because we expect long-form outputs. With long answers, we lose the ability to do unit testing and conduct large-scale automated testing. Instead, consider constraining outputs so that you can provide short answers that can be reliably evaluated by humans or automated systems. Do you really need long-form, conversational outputs? Are you willing to accept not knowing how accurate the outputs are?

Second, we should expect confidence measures from our LLM-based systems. When you constrain your outputs, you can better inform internal workflows of AI systems and add the ability to share confidence insights to your users. If answer confidence is 60% or even 85%, that invites the user to understand that the output may not be usable in the way they expect.

There are many different approaches to calculating confidence.
1) Some are based on monitoring the internal state of the system. You need to host the large language model, but you don’t need to do secondary inferences or additional calculations.
2) Some are black box. You can do this by sampling and you can manipulate the outputs to get calibrated probabilities. You can also measure variance that can be mapped to probability.
3) There’s also the possibility of training LLMs to provide accurate confidence scores.

This work is not simple, but it’s important. To do it correctly, you need to understand complex probability math and typically need to host an entailment model. But these are the skills and investments that make a difference between a chatbot that confidently shares an incorrect answer for a SNAP applicant and a chatbot that tells you it has a medium confidence in its answer and invites you to double-check it with traceable sources.

Our third recommendation is that we should favor buying products and services that communicate what they can do. We’re just about past the honeymoon phase of LLMs. Decide what your users should be able to do and tell them what they can’t. Communicate in advance what you can and will do with their inputs. Tell your users what outputs they need to double-check and give them the specific tools to do that work.

We would be happy to talk further to interested agencies and policymakers about these recommendations and show the commercial tools we’ve built that show answer confidence for chatbot questions and for large-scale textual data analysis, like surveys. Feel free to reach out to ai@aemcorp.com.

Thank you again for your time.

AEM's AI team stands out for our expertise in realizing the benefits of human-in the-loop approaches in deep learned systems, and we offer capabilities across a range of traditional ML areas. Contact us at ai@aemcorp.com to explore challenges your team is facing.

Meet Mat Morgan

BLOG POST

Remarks for Data Foundation: Adding a Measure of Confidence to AI

The following remarks were prepared for the Data Foundation event "Data Policy in the Age of Artificial Intelligence" on August 20, 2025.

RECOMMENDED BLOG POSTS

Understanding When AI Needs to be Governed at an Enterprise Level

Remarks for Data Foundation: Adding a Measure of Confidence to AI

Early Warning Systems: Detecting Public Concerns or Service Failures