Can You Trust AI-Generated SQL?

A vintage balance scale weighing a confident answer against a magnifying glass and a checkmark, over blueprint overlays.

You can trust AI-generated SQL when you can check it, not because the SQL looks right. Modern models write queries that run cleanly and return a confident number almost every time. The failure mode is not the query that breaks. It is the query that succeeds and measures the wrong thing. AI-generated SQL is safe to trust when it is grounded in your defined metrics, shows the query it ran, signals how sure it is, runs read-only, and leaves an audit trail.

Key takeaways

  • AI-generated SQL fails at metric meaning, not SQL syntax. It runs, returns a number, and answers a different question than the one you asked.
  • The failures are not bugs in the SQL. They are missing business context: which table is the source of truth, how "active user" is defined, whether "revenue" means calendar month or a rolling 28 days.
  • A query you can trust comes with five things: it is grounded in your defined metrics, it shows the query, it carries a confidence signal, it runs read-only, and it leaves an audit trail.
  • The number that does the most damage is the plausible one. A broken query gets caught. A wrong-but-believable number gets pasted into a board deck.

Is AI-generated SQL accurate?

A clean run tells you the SQL is valid, not that the answer is correct. Ask a model "how many active users did we have last month" and it will write a query that joins the right-looking tables, filters a date range, counts distinct user IDs, and returns 4.2 million. The query is flawless. It is also wrong if your company defines "active" as a meaningful engagement event and the model counted any event, including sign-ins, or if it pulled from a staging table nobody trusts.

This is the gap between syntax and meaning. Syntax is whether the SQL parses and runs. Meaning is whether it counts the thing your business actually counts. A model is excellent at the first and blind to the second, because the second lives in your company's head, not in the schema.

In Sundial's own evals, a strong model answering real questions against a clean warehouse that already had a semantic layer (a stored map of what each metric means and how tables relate) got the answer right about 80% of the time. The 20% that failed were almost all context failures, not SQL failures. Three common ones:

What the user askedWhat the model assumedWhat the business meant
New signupsCounted sign-ins, because the event data used both terms looselyFirst-time account creation only
Month X revenueA rolling 28-day windowA calendar month
RegistrationsThe wrong source tableThe source that excludes deleted users

Each query ran. Each returned a number. Each measured the wrong thing.

Why is a wrong answer worse than a broken query?

A wrong number that looks reasonable is more dangerous than a query that errors out. When SQL throws an error, you stop. When it returns 4.2 million active users and the real figure is 3.6 million, nobody stops. The number gets screenshotted into a Slack thread, copied into a presentation, and argued about in a meeting three weeks later, by which point no one remembers it came from a query nobody checked.

A wrong answer delivered with confidence is more dangerous than a slow one, because someone makes a decision on it before anyone catches the mistake.

This is why "the AI writes SQL fast" is not the same as "you can trust the AI's SQL." Speed without a way to verify just lets you be wrong faster. For a data analyst who reads SQL fluently, an 80% hit rate is a real gain: they catch the fishy 20% themselves. For a business user typing a question into a chat box, 80% is a trap, because they have no way to tell which one in five answers is the believable lie.

How do you verify AI-generated SQL?

Trust is not a property of the model. It is a property of the system around the query. The same model, given the right system around it, goes from a tool an analyst babysits to one a non-technical person can rely on. Five things have to be true.

Grounded in defined metrics. The query has to reference your official definition of each metric, not the model's guess at what the column names imply. This is what a semantic layer does: it stores what "active user" means, how your tables relate, and which source is the truth, so the SQL counts the thing your company actually counts. Your data team owns those definitions, one canonical version per metric, so the agent is not free to improvise them. Applying this context takes the same model on the same warehouse from about 80% to 98% accuracy. Same model, same data, the difference is business knowledge.

Shows the query. A trustworthy answer hands you the SQL it ran, not just the number. A data person can read it, see which tables it touched and which filters it applied, and catch a wrong join before it reaches a decision. An answer you cannot inspect is an answer you cannot check.

Carries a confidence signal. The system should tell you the difference between a solid answer and a rough estimate. A decision-maker treats "we are confident this is the canonical revenue figure" differently from "this is a directional read from a table we are less sure about." Hiding that uncertainty is how a guess gets mistaken for a fact.

Runs read-only for the people asking questions. What "read-only" means depends on who is using the agent, and the two audiences are different. A business user, someone who consumes data, gets the agent read-only by default: they ask a question, they get an answer, and they cannot change anything. A generated query can return the wrong number, but it cannot drop a table, overwrite data, or corrupt the source. The blast radius of a mistake is contained to the answer, not the database. A data practitioner uses the same agent to do more. It is the one tool that can both help model the tables and run data-quality checks, so the people who build your data can use it to maintain the context layer that everyone else relies on. Read-only is the consumer default. Practitioners get the modeling and quality tooling on top.

Leaves an audit trail. Every question, query, and answer is logged, so when a number looks off, someone can trace exactly how it was produced and where the logic went wrong. Without this, debugging a bad answer is guesswork. With it, a data team can see what the agent is doing across the company and fix the root cause once.

What is the difference between text-to-SQL and agentic analytics?

Most decisions need an investigation, not a single SELECT statement. "What was revenue last week" is one query. "Why did revenue drop last week" is a dozen queries plus the judgment to know which ones matter: segment by plan, by region, by cohort, rule out a billing glitch, check whether a definition changed. That judgment is what an analytics playbook encodes, the recurring-question method written down so the agent runs it the same way every time. A tool that turns one sentence into one query can only ever answer the shallow version.

This is the line between a text-to-SQL chatbot and agentic analytics. Text-to-SQL turns one sentence into one query and hands back the result. Agentic analytics is an AI agent that investigates a question end to end: it plans the steps, runs the queries, checks its own results, revises, and comes back with an answer plus the reasoning. Sundial is an agentic analytics system that, for the people asking questions, runs read-only against your warehouse and grounds every query in your defined metrics. At Sundial the work splits across four agents, each doing a job a human analyst does without thinking. Quality checks whether the underlying data is fresh and complete enough to answer at all, and is the same data-quality tooling a practitioner uses to keep the warehouse trustworthy. Modeling holds what the metrics and entities actually mean, and is how a practitioner models the tables and maintains the context layer. Analysis runs the chain of queries to get to "why," following an analytics playbook rather than improvising the investigation. Storytelling turns the result into something a decision-maker can act on. All four sit on a context layer that holds your business knowledge, so the SQL is grounded rather than improvised.

When should you not trust AI-generated SQL?

Match the trust requirement to the stakes. Generated SQL is ready for the open-ended questions that today become a ticket and a two-day wait: why did churn spike in this segment, what changed in this funnel. It is genuinely good for letting non-technical people get real answers without learning SQL or joining a queue, as long as the answer shows its work and its confidence.

Be more careful where every number has to tie out exactly. For a financial close or audited reporting, you still want governed numbers and a human sign-off, because the cost of a believable-but-wrong figure is too high. Humans stay in the loop on judgment and the highest-stakes calls.

This does not remove the data analyst, but it does transform the job, and a team that runs this well likely needs fewer analysts. The agent can do much of what analysts used to spend their days on: the repetitive pulls and the first pass at any investigation. So the role shifts from reactively answering query requests to architecting the context. The analyst defines the metrics, maps how the tables relate, and sets the source of truth, so that everyone's questions get high-quality answers from a layer they built. The point is not that AI SQL is untrustworthy. It is that trust has to be earned per answer, by a system that lets you check it.

In Sundial's published case study with OpenAI, root-cause investigations that took the data team two to three days of manual SQL across scattered systems dropped to minutes, and new hires contributed real insights on day one instead of week three. That speed only mattered because the answers were checkable.

If you want SQL you can actually trust, generated against your real definitions and shown with its work, that is what we build at Sundial.