arXiv: 2605.21027 | Published: May 2026
Researchers: Gundeep Singh, Parsa Kavehzadeh, Jing Xia, Xue-Yong Fu, Julien Bouvier Tremblay, Md Tahmid Rahman Laskar, Vincent Lum, Shashi Bhushan TN
Task Success on Real Enterprise Analytics Queries
Tested at two Fortune 500 companies. The difference is not a better model — it is a fundamentally different architecture.
Why This Paper Matters to Every Company That Has a Data Warehouse
For roughly a decade, a simple promise has driven enterprise AI investment: let non-technical teams ask questions in plain English and get answers from company data. Type “show me sales by region this quarter” and the AI writes the SQL, hits the database, and returns a chart.
It has not worked. Not in any serious enterprise deployment. And the reason is not that the AI cannot write SQL. The reason is that enterprises do not give AI systems direct database access.
A new paper from Dialpad — the company behind the contact-center AI platform — documents the failure and the fix at two Fortune 500 companies. For every Chief Data Officer, Chief Information Officer, and Head of Business Intelligence who has watched a Text-to-SQL pilot fail, this paper is the post-mortem and the blueprint for what comes next.
The Text-to-SQL gap has nothing to do with SQL accuracy. Academic benchmarks show 80%+ accuracy. The gap between 80% on a benchmark and 18% in a real enterprise comes down to three structural problems:
Fragmented, Moving Data
Enterprise data lives across warehouses, lakes, and engines. Schema changes constantly. Business logic lives in views and application code — not raw tables. Benchmark queries break immediately against a moving production system.
Governed Permissions
Different teams see different data. Some metrics are confidential. A Text-to-SQL system with direct database access bypasses every access control. One bad prompt becomes a compliance incident.
Business Logic Lives in APIs, Not Tables
The correct way to get “average daily call volume for the support team” is not raw SQL — it is calling the analytics API endpoint that already computes that metric with the correct business logic applied.
Methodology: The Six-Stage Pipeline, Explained Simply
Think of a corporate building with a guard at every door. Traditional Text-to-SQL tries to give every visitor a master key — direct database access. Dialpad’s Analytic Agent takes a different approach: no master key. Instead, the agent understands which rooms exist, which doors lead where, and how to ask the guard for permission.
Five specialized LLM agents work together across six stages:
1
— Parses user intent (“what is average daily call volume for support this week?”) and routes it to the appropriate specialist agent.
2
— Identifies the entity (“support team”), matches it against organizational metadata using fuzzy matching, and checks permissions. If access is denied — it stops here.
3
— Selects the correct analytics API endpoint (not SQL) and constructs the request. Date ranges are handled deterministically, not by the LLM, to prevent hallucinated time ranges.
4
— API call executes. A two-tier retry loop kicks in on failure: programmatic correction for predictable errors, LLM reasoning for ambiguous ones. This is what makes it production-grade.
5
— Generates chart configurations using deterministic rules: line charts for time series, bar for comparisons, pie for proportions. Chart logic is not LLM-generated.
6
— The Orchestrator wraps the result and chart into a plain-English response. Runs on Google Cloud Run using Gemini-2.5-Flash in production.
What the Numbers Actually Mean for Your BI Strategy
| Model Tier | Execution Success | Relative Cost |
|---|---|---|
| Gemini 2.5 Pro (most powerful) | 96.67% | $$$ |
| Gemini 2.5 Flash (production choice) | 94.44% | $ ✓ |
| Gemini 2.5 Flash-Lite (smallest) | 44.44% | ¢ |
The gap between Pro and Flash: 2.2 percentage points. The gap between Flash and Flash-Lite: 50 percentage points. The threshold model capability is real but not high. A sufficiently capable model with the right architecture beats a powerful model with the wrong one. Enterprise analytics AI is an orchestration problem, not a benchmark problem.
Core Architectural Insight
The LLM Should Be a Planner Over Stable, Governed Interfaces — Not the Repository of Business Logic
The business logic lives in the APIs. The permissions live in the governance layer. The LLM navigates them. Governance is not a constraint the AI must work around — it is the reason the AI works at all.
Key Takeaways for Chief Data Officers and Heads of Business Intelligence
Stop investing in Text-to-SQL. Start investing in API orchestration.
The paper makes the empirical case. Direct database access through LLMs fails in enterprise settings. If your data platform team is building Text-to-SQL features, this is the evidence you need to redirect investment.
Build the governance layer first.
The paper’s architecture puts permissions, entity resolution, and validation between the LLM and the data. The governance layer is not an add-on — it is the foundation.
Do not optimize for the most powerful model.
Flash delivered 94.44% execution success at a fraction of Pro’s cost. The architecture — governed APIs, retry logic, deterministic date handling, caching — compensates for model capability differences. Optimize for architecture and cost, not benchmark scores.
The retry loop is the difference between a demo and a deployment.
Any analytics AI can handle the happy path. Production systems handle the unhappy path gracefully. The paper’s two-tier retry — programmatic for predictable errors, LLM-powered for ambiguous ones — is the production pattern.
Caching is the unsung economics story.
64% cost reduction and 22% latency reduction from caching. In enterprise analytics, many queries are repeated — same metric, same team, same time range. Caching captures these and eliminates re-execution cost entirely.
The role of the Chief Data Officer is shifting.
Instead of writing SQL for every business user request, the data team builds analytics APIs with clear interfaces, documented metrics, and governance rules. The AI navigates them. The team maintains them. The architecture scales.
Thanks to All Authors
Gundeep Singh — Dialpad Inc., San Francisco, CA
Parsa Kavehzadeh — Dialpad Inc., San Francisco, CA
Jing Xia — Dialpad Inc., San Francisco, CA
Xue-Yong Fu — Dialpad Inc., San Francisco, CA
Julien Bouvier Tremblay — Dialpad Inc., San Francisco, CA
Md Tahmid Rahman Laskar — Dialpad Inc., San Francisco, CA
Vincent Lum — Dialpad Inc., San Francisco, CA
Shashi Bhushan TN — Dialpad Inc., San Francisco, CA
Frequently Asked Questions
What does this mean for a Chief AI Officer?
Your Text-to-SQL pilots are failing not because AI models are weak, but because enterprise governance prevents direct database access—and that’s actually correct security policy. The 92% success rate from governed API orchestration shows you can solve this by building a different architecture layer that sits between AI systems and your data, rather than trying to make AI safer at writing SQL. This shifts your investment from model improvement to orchestration infrastructure.
Why do academic benchmarks show 80%+ SQL accuracy while real deployments fail 82% of the time?
Benchmarks test whether AI can write correct SQL syntax in isolation, but enterprise data environments involve access controls, schema complexity, business logic rules, and governance policies that academic datasets don’t replicate. When an AI system hits a permission denial or schema constraint, it fails the entire query—not because the SQL was wrong, but because it attempted something the system architecture forbids. The Dialpad research shows that wrapping multiple smaller API calls through a governance layer succeeds where single-query SQL approaches break.
How does this research relate to an AI Assessment for companies?
Companies evaluating their AI analytics capabilities through an AI Assessment framework should distinguish between model capability (what the AI can theoretically do) and system capability (what the architecture actually allows). Silicon Valley Certification Hub recommends using this paper’s findings to audit whether your organization is testing Text-to-SQL in a realistic enterprise context with actual security policies, not in a sandbox—because the gap between those two scenarios is where billions in failed AI investments have landed. A proper assessment should measure orchestration readiness, not just model performance.
What should we do differently in our next AI analytics initiative?
Start by mapping your governance requirements and existing API ecosystem before selecting a query approach—the architecture choice must follow your security model, not precede it. Then pilot governed API orchestration with a small set of common analytics queries rather than attempting unrestricted natural language SQL, because the research shows this delivers immediate results while remaining compliant with your data access policies. This approach shifts your analytics AI from a long-term research bet to a near-term operational win.
Want to know how this applies to your company?
At Silicon Valley Certification Hub, we help you align AI + Strategy. Our team works directly with your directors and teams to assess AI readiness, identify gaps, and build a clear path forward — tailored to your business context.
Book a time with our CEO, Alejandro Cuauhtemoc-Mejia
Silicon Valley Certification Hub | 3000 El Camino Real, Building 4, Palo Alto, CA
0 Comments