Our federal workforce and our contractors – that’s hundreds of thousands of people – are using LLM-powered products for things they can’t do well. The LLM-powered products are still giving them answers, and the users assume that they are correct.
You don’t want bad data in your solicitations, in your communications to the public, and in your reports to Congress. The products tell you to double-check, but do you? Or can you?
Leading LLMs, including state of the art reasoning models, sound correct but they DO NOT understand their own confidence.
It’s important to stress that they sound confident whether they subtly misstate an answer or come to an entirely incorrect conclusion based on correct and complete inputs. This applies to analyzing documents, summarizing comments, asking policy questions, and more. Chain of thought is not necessarily correct in reasoning models. LLMs sound confident even when they are wrong.
There’s another, related issue. Most LLM-powered products make no attempt to communicate their limitations.
You have an open text input field and you get a response. You are expected to develop an intuition through repeated use for what they can and can’t do. This takes hundreds of hours of manual checking and data tracking, and even if it works well a few times, you may incorrectly assume it will work indefinitely as your data, task, or models change slightly. There’s no advance guidance and no guardrails if you use incorrectly or are given bad results.
These systems are being used by people who working hard on deadlines. Even where the answers may inform major decisions, there is a real risk that the user will simply trust AI and not do the actual work needed to double check the LLM’s outputs.
For this reason, education is not a substitute for better-designed AI systems. We think the real fix at scale is at the AI model and product level.
Read the full remarks at our AI blog: https://www.aemcorp.com/ai/blog/remarks-for-data-foundation-adding-a-measure-of-confidence-to-ai