The Schema Context Document: Why AI SQL Queries Fail Without It

Natural language to SQL is one of those technologies that looks magical in demos and fails spectacularly in production. I've seen it work and I've seen it fail, and the difference almost always comes down to one thing: whether the AI knows what your data actually means.

The problem isn't the AI's SQL-writing ability. Claude, GPT-4, and similar models can write syntactically correct SQL all day. The problem is that "syntactically correct" and "semantically correct" are very different things.

Why "Just Send the Schema" Doesn't Work

The naive approach to NL-to-SQL is: dump the table schema into the prompt, append the user's question, and ask the model to write SQL. This works beautifully on demo databases with 5 tables named "customers", "orders", and "products."

It fails immediately on real-world databases because:

Column names are meaningless. Your DW has columns named "flg_active", "dt_eff", "cd_status", and "amt_rev_net_adj_ytd." The AI doesn't know that "flg_active" means "this customer has made a purchase in the last 12 months" or that "amt_rev_net_adj_ytd" means "year-to-date revenue after returns and promotional adjustments."

Business logic is invisible. When a user asks "how much revenue did we do last quarter?" they mean something very specific. They mean net revenue, after returns, after promotional adjustments, using the fiscal calendar (which might not align with the standard calendar), excluding internal test orders, and counting the order date (not the ship date).

None of that is in the schema. It's in tribal knowledge, scattered across old SQL queries, and in the heads of two people who've been at the company for 10 years.

Relationships aren't obvious. Your schema has 78 tables. The AI needs to know that "dim_customer" joins to "fact_orders" on "customer_sk", not "customer_id" (which is the natural key from the source system and doesn't match the surrogate key in the fact table).

What a Schema Context Document Contains

A Schema Context Document (SCD) is a plain-English knowledge base that tells the AI everything it needs to know about your data. For every table and every important column, it includes:

What this table represents in business terms
What each column means — not the data type, the business meaning
How this table relates to other tables — join keys, cardinality, common join patterns
Business rules and edge cases — fiscal calendar logic, exclusion rules, known data quality issues
Example queries — for common questions, what the correct SQL looks like and why

For LucidAgent's 78-table database, the SCD runs about 15,000 words. That sounds like a lot, but it's the difference between "the AI writes random SQL" and "the AI writes the query your analyst would have written."

How It Becomes the AI's Knowledge Base

The SCD goes into the system prompt (or a retrieval context layer for larger schemas). When a user asks "What's the contract award trend for Department of Defense this quarter?", the AI doesn't just see "fact_contracts" and "dim_agency" — it knows:

"Department of Defense" maps to agency_code = '9700' in dim_agency
"contract award trend" means sum of obligated_amount grouped by award_date month
The current fiscal quarter starts in January (federal fiscal year)
fact_contracts joins to dim_agency on agency_sk
Awards with status_code = 'X' are de-obligations and should be excluded

That's the difference between a demo and a product.

The Validation Loop

Even with a good SCD, AI-generated SQL needs validation. Our pipeline works like this:

1. User asks a question in natural language 2. AI generates SQL using the SCD context 3. SQL is executed against the database 4. Results are validated (row count sanity check, null ratio check, range check) 5. AI writes a narrative explanation of the results 6. If validation fails, the AI gets the error and retries with corrected SQL

This loop catches most errors. For LucidAgent, we see about 92% first-attempt accuracy on queries within the documented schema, and 98% after the retry loop.

The Investment

Creating an SCD for an existing database is typically a 1-2 week engagement. We work with your data team to document every table, validate business rules, and build the example query library. The output is both a human-readable document and the AI context configuration.

Once it exists, it's also the best documentation your database has ever had — which has value far beyond the AI use case.

Adding NL Query to Your Warehouse?

If you have a data warehouse and want to make it queryable in plain English, the SCD is where we start. We can scope a Schema Context Document and NL Query API for your warehouse in about a week. luciddatamind.com/contact