← back to writing
February 20, 2025/7 min read

Text-to-SQL on Schemas That Don't Look Like Textbooks

PythonLLMSQLSGLang

The Problem With Demo-Grade Text-to-SQL

Every Text-to-SQL benchmark uses clean, normalized schemas: users, orders, products. Tables have sensible names, foreign keys are explicit, columns follow conventions.

Real databases look like this:

SELECT * FROM acct_mstr_v2_PROD
JOIN txn_log_20190312_bkp ON acct_mstr_v2_PROD.acct_id = txn_log_20190312_bkp.a_id
WHERE txn_log_20190312_bkp.flg_del != 1

acct_mstr_v2_PROD. txn_log_20190312_bkp. flg_del. a_id. These names are meaningless to an LLM with no context. Most Text-to-SQL systems fail silently here — they generate plausible-looking SQL that returns wrong results.

Mercer is built to handle this.

The Core Architecture

Mercer has three stages before it ever touches an LLM:

1. Schema enrichment — parse the schema, infer column semantics from sample data, and build a rich context document that includes column names, types, example values, and inferred purpose.

2. Query decomposition — break the user's question into sub-goals. "Which accounts had the most transactions last quarter?" becomes: find the time boundary, identify the transaction table, group by account, order by count.

3. Constrained generation — use SGLang's structured output to enforce that the generated SQL only references tables and columns that actually exist.

@sgl.function
def generate_sql(s, schema_context: str, question: str):
    s += sgl.system("You are a SQL expert. Only use tables and columns from the schema.")
    s += sgl.user(f"Schema:\n{schema_context}\n\nQuestion: {question}")
    s += sgl.assistant(sgl.gen("sql", max_tokens=512, stop=["```"]))

Schema Enrichment in Practice

The hardest part is building useful context from a bad schema. Here's how Mercer approaches flg_del:

def infer_column_purpose(col_name: str, samples: list) -> str:
    # Heuristic: short names with 0/1 values are likely boolean flags
    if len(col_name) <= 6 and set(samples).issubset({0, 1, None}):
        return f"{col_name}: boolean flag (0=false, 1=true)"
 
    # Names with 'del', 'rm', 'arch' suggest soft-delete patterns
    if any(kw in col_name.lower() for kw in ["del", "rm", "arch", "dlt"]):
        return f"{col_name}: likely soft-delete marker"
 
    return f"{col_name}: unknown purpose, sample values: {samples[:5]}"

The enriched schema document gives the LLM enough signal to reason about ambiguous columns.

Results

On internal benchmarks against a real legacy ERP schema (87 tables, average column name length: 8 chars), Mercer achieves:

  • 73% exact match on single-table queries
  • 61% exact match on multi-table joins
  • 91% syntactic validity (the SQL runs without errors)

Compare to a naive GPT-4 baseline: 44% exact match, 78% syntactic validity. The gap is entirely in schema comprehension.

What's Still Hard

Multi-hop joins across 4+ tables remain a weak point. The model loses track of the join path. The fix is probably a planning step — generate the join graph first, then fill in the SQL. That's next on the roadmap.

Mercer is open-source. If you're dealing with a legacy schema that makes you want to cry, try it.