Why generating raw SQL by agents is hard

Motley Team

•

March 27, 2026

Semantic layer for agents, part I

A common task for AI agents is generating database queries. This could be in response to a user query, or as part of an ongoing workflow the agent orchestrated itself.

‍

For the more complex of these queries, there are two schools of thought: one, that agents should just generate the raw SQL for the query (if they can generate hundreds of lines of code, why not SQL?); another, that it’s better to have the agent generate queries to an intermediate layer usually called the semantic layer.

‍

Motley comes down on the second side, with caveats; this series covers the reasoning behind this and the tradeoffs associated with either option. This post will cover the challenges of generating raw SQL.

‍

So why is generating SQL hard?

SQL is a hugely complex, flexible language. The space of all valid SQL queries is much larger than the space of queries likely to be requested in the process of answering a business-related question.

‍

The first challenge is to make sure that what the agent produced is actually valid SQL, and that it makes sense against the database you’re querying. That is dealt with easily enough by building a validation loop, that is feeding back to the agent any errors produced by trying to run what it generated.

‍

But now comes the fun part. A query that answers a business question correctly must get several distinct things right, and will not be good enough if even one of these is done wrong. Here are the dimensions of complexity:

‍

Business logic

The business logic can be of two kinds, objective logic and conventions

‍

Objective logic refers to things that can be clearly verified to be true or false, for example how to convert a monetary total to EUR given that each aggregated row has an entry in a different currency, combined with the quoting conventions in the exchange rates table. Mildly fiddly, but in the end either it’s done correctly or not.

‍

Conventions are another, much harder challenge. For example, how is each funnel stage defined in a customer journey? What are the filters “everyone knows” you must apply to the raw data for your results to make sense (eg exclude columns with the deleted_at field being not null)? If you want data “year to date”, which of the timestamp columns in a table do we apply that filter to?

‍

A related challenge is relationships between tables. If these are expressed as foreign key relationships in the database, the agent can at least gain that data by introspecting it, but sometimes data is just used for enrichment, doesn’t fulfill the foreign key constraint, and is not documented in the database.

‍

Transformations

Only sometimes can the question you care about be answered by a straightforward groupby query. Often, you have to do transformations before, during or after the query:

‍

Before doing the main query, you might need to first calculate the values that that query will use. For example, if you want to display revenue grouped by month-on-month change in customer activity, you’ll first need to calculate the latter by customer (two steps, actually, first activity by customer by month, then its change), then join the result to the customer ID so you can group by it.

‍

During the main query, you might have transformations such as time shift (here is this complex metric, now give me the same thing but shifted by a year).

‍

After the main query is done, you might want to further transform its results, for example looking at the latest change along the timestamp dimension of that query.

‍

The before/during/after distinction is more conceptual than technical here — it is absolutely possible to combine all of the above into one monster query (and many solutions, with or without agents, do).

‍

Consistency

Even if all of the above have been solved, consistency remains a challenge. Often, there are several reasonable ways of translating a natural language query into SQL. Each of these may be individually defensible, but if the agent randomly chooses a different one every time it’s asked the same question, the results will be very hard to interpret and not perceived as reliable.

‍

So how do you solve that?

How do you deal with these challenges? The naive way is to inject instructions for all of the above into the agent’s context, and hope.

‍

However, the problem with that approach is not even that any one of these challenges is that hard, but that the agent must often solve for all of them at once — and as usual, as soon as you try to squeeze too many things into a prompt, the chances grow that the LLM will ignore some of them.

‍

A better approach is formal validation; but it is quite difficult to write for a general case (for example, just imagine having to check “does this query contain a time shift, and if so is it done correctly”, and so for every other component of the query. It can maybe be done, but the effort is far from trivial.

‍

Introducing semantic layers

Now imagine that instead of writing raw SQL, the agent could express a query using all of the above in an intermediate formal language that mostly only dealt with intent rather than implementation (which agents are great at) and can be validated easily; and then a deterministic procedure was applied that translated that formal language into a SQL query that is guaranteed to be valid, to apply all the conventions such as required filters, and to correspond perfectly to the formal language statement (we can give these guarantees because the translation is deterministic). For example, the agent could essentially say something like “Series 2 = series 1 time shifted by 12 months”, and the deterministic procedure would translate that into the correct SQL applied to the correct time column.

‍

That is what a semantic layer does. The next post will cover how these work, the tradeoffs involved in using one, and how these can be navigated.

‍