Skip to content
Back to blog
Strategy11 min read

Why Most AI Demos Never Reach Production, and the Foundations That Get Them Live

An impressive AI demo is the easy part. Here is the real gap between a demo and a system you can put in front of users: accuracy, compliance, reliability, and security.

June 27, 2026
Close-up of a circuit board, the engineering foundations behind production AI systems

A good AI demo is easy to fall in love with. You type a question, the system answers in clean prose, everyone in the room nods, and the project gets a green light. Then it goes quiet for three months, and the thing never ships.

This happens a lot. The demo looks like the work, but it is maybe a fifth of it. The other four fifths are the unglamorous foundations that decide whether real users can rely on the system every day: whether the answers are correct, whether you are allowed to handle the data the way you do, whether the system stays up, and whether it keeps the secrets it is supposed to keep.

This piece is about that gap. It is written for the person who has seen a promising prototype and now has to decide whether to put it in front of customers, a regulator, or their own staff. We will go through four foundations, use one of our own products as a running example, and end with where each piece fits.

We build and run dopomo.pl, a live multilingual AI assistant that helps migrants in Poland work through immigration procedures in their own language. It is our own product, so we feel every one of these problems directly. We come back to it throughout, because it is the clearest example we have of each foundation doing its job.

The demo lies a little

A demo runs on a happy path. You ask the questions it was built to answer, the data is clean, and nobody is trying to break it. Production is the opposite. Real users ask things you never anticipated, in bad grammar, in four languages, at 2am, sometimes on purpose to see what happens.

That gap between the two worlds is where most AI projects stall. Not because the model is bad, but because nobody built the parts that turn a clever response into a system you can stand behind.

Four of those parts matter more than the rest.

Foundation 1: accuracy and grounding

A language model will answer almost anything, including questions it has no business answering. It does not know what it does not know. Ask it for a specific immigration deadline and it may hand you a confident, fluent, wrong number. In a demo that looks fine. In production it is a person missing a legal deadline because your software made something up.

The fix is not a better prompt. It is grounding: the system answers from a known, trusted set of sources instead of from the model's memory. That is what retrieval (often called RAG) does. You keep a knowledge base of documents you trust, the system finds the relevant passages for each question, and the answer is built from those passages. If the sources do not cover the question, the right behaviour is to say so rather than improvise.

Grounding only helps if you check that it actually happened. So we add a separate grounding check: after the model writes an answer, a second pass audits it against the retrieved sources and flags claims nothing supports. An answer that wanders off its sources gets caught before a user sees it.

And you have to measure all of this, because "it seemed good in the demo" is not a quality bar. We run an evaluation harness: a fixed set of real questions with known good answers, scored automatically every time the system changes, as part of CI. When someone tweaks a prompt or swaps a model, we see whether retrieval quality went up or down before it ships, not after a user complains. We talk about accuracy in terms of what that harness reports, never as a guarantee, because no honest person guarantees a model's output.

On dopomo.pl this is the whole game. Every answer is grounded in official Polish government sources. The grounding check audits responses for anything it cannot cite. The product gives people information, not legal decisions, and it refuses questions outside its scope, asylum claims for example, with a referral to the right body instead of a guess. Legal content gets a paralegal sign-off before it ships. The evaluation harness runs in CI, so we know retrieval quality before each release.

Foundation 2: data residency and compliance

The moment your AI touches real people's data, a different set of questions arrives, and they are legal, not technical. Where does the data physically live? Who processes it? What happens when someone asks you to delete everything you hold about them?

For anyone serving EU users, the baseline is EU data residency: the data and the AI processing stay inside the EU instead of being shipped to a server in another jurisdiction. That is a real architectural decision, and it limits which providers and regions you can use. It is far easier to design in from the start than to retrofit later.

GDPR sets out the rest. You need to know what you collect and why, keep records of processing, run a data protection impact assessment where the processing is sensitive, and actually honour data-subject rights, including erasure. "We'll add that later" is how projects end up unable to answer a regulator.

The EU AI Act adds a layer on top. Most business AI is not high-risk, but you are expected to assess that and document the conclusion, not assume it. There is also a transparency duty: when a person is talking to an AI, they have to be told. That is a small thing to build and an expensive thing to forget.

On dopomo.pl, the data is processed in the EU across two EU-hosted AI providers. We completed a DPIA, keep records of processing, and support data-subject rights including erasure. We ran a formal EU AI Act assessment, which came out as not high-risk, and the chat tells users plainly that they are talking to an AI. Sensitive identifiers are encrypted at the field level. None of that shows up in a demo. All of it is the difference between a tool you can legally run and one you cannot.

Foundation 3: reliability and monitoring

A demo only has to work once, while you are watching. A production system has to work when you are asleep, and you have to find out the moment it stops.

That is what "operating" an AI system actually means, and it is the part people underestimate most. Models drift. A provider has an outage. Latency creeps up until the experience feels broken. A retrieval index goes stale. None of this announces itself. You either have instrumentation that tells you, or you hear about it from an angry user.

So you need observability: traces and logs that let you reconstruct what happened on a specific request, with personal data scrubbed out. You need alerting, so a spike in errors or latency wakes a human instead of sitting in a dashboard nobody is watching. And you need performance budgets, explicit limits on how slow a page or a response is allowed to get, enforced automatically so quality does not erode one small regression at a time.

On dopomo.pl we run error tracking with personal data scrubbed, distributed tracing through OpenTelemetry, and CloudWatch alarms and dashboards for the infrastructure. Performance budgets are enforced with Lighthouse checks so the experience does not quietly degrade. Operating the system is not a phase after launch. It is the job.

Foundation 4: security

Everything above assumes the data you are protecting stays protected. Security is the foundation the other three sit on, and AI systems hold exactly the data attackers want: personal details, documents, identifiers, sometimes the most sensitive facts about someone's life.

The principles here are not new, but AI raises the stakes because these systems ingest so much. Encrypt sensitive data at rest and in transit, and encrypt the most sensitive fields individually so a single leak does not expose everything. Apply least privilege: every component and person gets the minimum access they need and nothing more. Manage secrets properly, with rotation, instead of leaving API keys in a config file. Keep the infrastructure itself defined in code, so its security posture can be reviewed rather than assembled by hand and forgotten.

On dopomo.pl the infrastructure runs on AWS, defined with Terraform, with managed Postgres and a vector database, a CDN, secrets management with rotation, and field-level encryption of sensitive identifiers. A buyer never sees this layer. It is the reason the rest is safe to run.

Why a demo is cheap and a system is not

Put those four together and you can see why the jump from demo to production is so much larger than it looks. The demo proves the idea is possible. The foundations prove it is safe, legal, reliable, and correct enough to put in front of someone whose deadline, money, or immigration status depends on the answer.

That is also the honest reason AI projects stall. The demo gets built, everyone is excited, and then the work that does not demo well, the grounding checks, the DPIA, the alerting, the encryption, runs out of budget or patience. The system never crosses the line.

How this maps to the way we work

We organise our work around six areas, and the path from demo to production touches all of them.

AI products and solutions is the demo made real: the retrieval, the grounding, the evaluation harness. Automation handles the workflows around it. Infrastructure is the cloud, the containers, and the code that defines them. Reliability and monitoring is the observability and alerting that keep it running. Security and compliance is the encryption, the access control, GDPR, and the AI Act. Content and SEO is how people find it once it works.

dopomo.pl is where all six meet in one product we own and run, and the same foundations go under the work we do for clients. There is more on our how we build page, and the dopomo case study walks through one system end to end. For the shorter version of what we do, the solutions page lays it out.

If you are sitting on a demo

If you have a prototype that impressed everyone and then stalled, the demo is probably not the problem. The four foundations are where the work and the risk live, and they are worth mapping before you commit a budget.

A good place to start is our AI Opportunity Map, which helps you see where AI is worth building in your operation and what it would take to run it for real. If you would rather talk it through, book a call and we will look at your specific case.

We use cookieless analytics by default. With your consent we also load advertising cookies (Google Ads, LinkedIn, Meta) for conversion measurement. You can change your mind at any time. Learn more in our Privacy Policy