A good AI demo is easy to fall in love with. You type a question, the system answers in clean prose, everyone in the room nods, and the project gets a green light. Then it goes quiet for three months, and the thing never ships.
This happens a lot. The demo looks like the work, but it is maybe a fifth of it. The other four fifths are the unglamorous foundations that decide whether real users can rely on the system every day: whether the answers are correct, whether you are allowed to handle the data the way you do, whether the system stays up, and whether it keeps the secrets it is supposed to keep.
This piece is about that gap. It is written for the person who has seen a promising prototype and now has to decide whether to put it in front of customers, a regulator, or their own staff. We will go through four foundations, use one of our own products as a running example, and end with where each piece fits.
We build and run dopomo.pl, a live multilingual AI assistant that helps migrants in Poland work through immigration procedures in their own language. It is our own product, so we feel every one of these problems directly. We come back to it throughout, because it is the clearest example we have of each foundation doing its job.
The demo lies a little
A demo runs on a happy path. You ask the questions it was built to answer, the data is clean, and nobody is trying to break it. Production is the opposite. Real users ask things you never anticipated, in bad grammar, in four languages, at 2am, sometimes on purpose to see what happens.
That gap between the two worlds is where most AI projects stall. Not because the model is bad, but because nobody built the parts that turn a clever response into a system you can stand behind.
Four of those parts matter more than the rest.
Foundation 1: accuracy and grounding
A language model will answer almost anything, including questions it has no business answering. It does not know what it does not know. Ask it for a specific immigration deadline and it may hand you a confident, fluent, wrong number. In a demo that looks fine. In production it is a person missing a legal deadline because your software made something up.
The fix is not a better prompt. It is grounding: the system answers from a known, trusted set of sources instead of from the model's memory. That is what retrieval (often called RAG) does. You keep a knowledge base of documents you trust, the system finds the relevant passages for each question, and the answer is built from those passages. If the sources do not cover the question, the right behaviour is to say so rather than improvise.
Grounding only helps if you check that it actually happened. So we add a separate grounding check: after the model writes an answer, a second pass audits it against the retrieved sources and flags claims nothing supports. An answer that wanders off its sources gets caught before a user sees it.
And you have to measure all of this, because "it seemed good in the demo" is not a quality bar. We run an evaluation harness: a fixed set of real questions with known good answers, scored automatically every time the system changes, as part of CI. When someone tweaks a prompt or swaps a model, we see whether retrieval quality went up or down before it ships, not after a user complains. We talk about accuracy in terms of what that harness reports, never as a guarantee, because no honest person guarantees a model's output.
On dopomo.pl this is the whole game. Every answer is grounded in official Polish government sources. The grounding check audits responses for anything it cannot cite. The product gives people information, not legal decisions, and it refuses questions outside its scope, asylum claims for example, with a referral to the right body instead of a guess. Legal content gets a paralegal sign-off before it ships. The evaluation harness runs in CI, so we know retrieval quality before each release.
Foundation 2: data residency and compliance
The moment your AI touches real people's data, a different set of questions arrives, and they are legal, not technical. Where does the data physically live? Who processes it? What happens when someone asks you to delete everything you hold about them?
For anyone serving EU users, the baseline is EU data residency: the data and the AI processing stay inside the EU instead of being shipped to a server in another jurisdiction. That is a real architectural decision, and it limits which providers and regions you can use. It is far easier to design in from the start than to retrofit later.
GDPR sets out the rest. You need to know what you collect and why, keep records of processing, run a data protection impact assessment where the processing is sensitive, and actually honour data-subject rights, including erasure. "We'll add that later" is how projects end up unable to answer a regulator.
The EU AI Act adds a layer on top. Most business AI is not high-risk, but you are expected to assess that and document the conclusion, not assume it. There is also a transparency duty: when a person is talking to an AI, they have to be told. That is a small thing to build and an expensive thing to forget.
On dopomo.pl, the data is processed in the EU across two EU-hosted AI providers. We completed a DPIA, keep records of processing, and support data-subject rights including erasure. We ran a formal EU AI Act assessment, which came out as not high-risk, and the chat tells users plainly that they are talking to an AI. Sensitive identifiers are encrypted at the field level. None of that shows up in a demo. All of it is the difference between a tool you can legally run and one you cannot.