When I was in high school, my family owned a car stereo shop. Along with manning the register and sweeping the floors, one of my duties was to help with the car stereo installation process. I spent many of my formative years learning the mystic arts of car stereo installation and after many years of apprenticeship, I was allowed to do the simple installations, primarily in-door speakers. One of the most important things I learned during my apprenticeship was tool selection. There was always a tool for a job no matter how hairy or complex. Sometimes it was tin snips to widen a hole, other times it was a battery with wires attached to it to test the speaker via its contacts. The most frequent tool used in this process was made using a straightened wire coat hanger and electrical tape. We'd attach speaker wire to the hanger and use it to lace through the nooks and crannies of the customers' car. During my apprenticeship, I learned how to discern which tool was best for each job.
Most of the teams I work on nowadays spend large amounts of time and resources investigating the litany of AI tools and technologies available. The primary goal of this research is to leverage these tools to improve our users' and customers' experiences. One area of research that seems promising is using natural language to build information-retrieval applications capable of gleaning information from company databases, documents, and other sources of information. Basically, internal search engines.
Over the last three-plus decades, I've worked on tons of database systems. Without fail, every system ended up with some form of hand-cobbled query generator in which end users could write their own queries. This hand-cobbled system was usually some type of screen resembling the command console from Starship Enterprise. It takes a PhD just to generate mailing list from these beasts. LOL.
What if we could create systems where users use natural language to ask questions like: “Who were my most valued customers last month?” or “Break down sales by product and by month over the last three years?”
My guess is that the current state of AI tech should be able to handle this scenario.
I began my research looking at available open-source tools and code. Overall, I felt that to create something really useful, the cost-to-value ratio would be prohibitive. I started looking for something that was more self-contained. Luckily for me, Snowflake, my cloud database provider, has added a huge set of AI features, and one of them might work for my needs. After a bit of research, I found Cortex, a natural language-to-SQL tool built into the Snowflake architecture. Cortex works by taking a natural language query, sending it to Snowflake's LLM engine accompanied by metadata (defined with a YAML-based semantic model), and results in a SQL Query. You then take that query and pass it to Snowflake, which returns your query data. Figure 1 demonstrates this.

After building these different pieces and parts, I was able to start firing queries at the engine and, to be honest, its success rate was impressive. I did a demonstration of this a few months back in front of a live meeting and it generated successful queries around 95% of the time. This begs the question: “Yes, it generated queries that ran without error 95% of the time, but did these queries reflect the actual content of my question?” This was something that I considered nearly immediately. If I want to put this in users' hands, how do I make sure that the system provides accurate answers?
My team is starting to handle this using two techniques. One is taking advantage of the Verified Query feature provided by Snowflake's semantic model specification. Verified queries are ones that developers deem “good” and that can be used to train the model. The more verified queries, the better the model result.
This is augmented by using unit testing to send variants of these queries to the engine and then verifying them against known good queries. Thes two techniques were very loosely inspired by an article I read about the Cleveland clinic. The article, “Mayo Clinic's secret weapon against AI hallucinations: Reverse RAG in action”, discusses using multiple passes of RAG to help prevent hallucinations. The goal is to enable users to ask discerning questions and get deterministic data in return. Although this RAG style of validation isn't perfect, it's moving us closer to putting these systems into user's hands.
As researchers, we need to clarify with our users the true nature of these LLM-based AI systems. The LLMs are discerning systems. They make best guesses from your queries and return answers that may or may not be deterministic. Users of technology are used to getting deterministic answers from technology: We put two-plus-two and we expect to get four. This is why LLMs are really buyer-beware.
That returns this discussion to that apprentice stereo installer, now the AI researcher apprentice. What tools will the researcher decide are the best for the job? Time will tell, but semantic models, RAG-inspired validation, and reinforced model training with verified queries seem like some long-terms tools to be added to the toolbox.