Turning Text into SQL: A Python and LLM-Powered Approach

Introduction

Transforming natural language into SQL queries simplifies database access, relying on a clear grasp of table roles and relationships. This project blends Python, PL/SQL, and the Azure OpenAI API to convert text requests into precise SQL statements. By parsing PL/SQL routines, generating metadata, and leveraging embeddings, it creates a robust text-to-SQL system. This article guides developers through the steps, offering a framework to adapt for their own databases.

Project Overview

The system takes a text query and produces a SQL statement, using Python for preprocessing, a database for metadata storage, and Azure OpenAI for analysis. The AzureOpenAIClient class (azure_openai_client.py) handles API interactions, utilizing gpt-4o-mini for descriptions, text-embedding-ada-002 for embeddings, and gpt-4o for query generation, with costs tracked and capped at $5 per script. Tailored for developers, it automates table analysis and delivers a complete text-to-SQL solution.

Technical Breakdown

Preparing the Groundwork with PL/SQL Routine Extraction

split_save_pkg.py launches the pipeline by parsing PL/SQL package body files, splitting procedures and functions into individual .SQL files (stored in an output directory). Using regex and token counting, it isolates routine boundaries to support usage analysis.

Storing Routine Metadata in the Database

populate_rma.py loads these routines into a metadata table, extracting key details by traversing the output directory. This centralizes PL/SQL logic for further querying.

Linking Tables to Routines with Usage Frequency

populate_link_table_rte.py scans routine bodies for table references (like FROM or JOIN clauses), counting occurrences and storing them in a usage table. This quantifies table importance, aiding later interpretation by the language model.

Setting Up the Database Schema

A SQL script establishes core tables to hold metadata, evolving them with fields like table structure (DDL), descriptions, and embeddings. Populated from database sources, these tables connect raw data to LLM-generated outputs.

Generating Table Descriptions with Azure OpenAI

pop_db_routine_table_usage_description.py uses AzureOpenAIClient with gpt-4o-minicreate concise descriptions (2-4 sentences) for each table-routine pair, based on structure and routine logic. These are stored for later use.

Adding SQL Query Examples with Azure OpenAI

pop_db_routine_table_usage_sql_example.py generates SELECT-only queries and matching descriptions via AzureOpenAIClient, saving them to illustrate routine-specific table usage.

Synthesizing General Table Descriptions

gen_table_description.py combines routine-specific descriptions, query examples, and table structure into a unified description per table using AzureOpenAIClient. Processed in batches of 50, these are stored in the metadata.

Crafting General SQL Query Examples

gen_table_query_example.py produces a representative SELECT query and description per table with AzureOpenAIClient, incorporating aliases and typical joins. These are saved to reflect common usage patterns.

Enhancing Table Selection with Embeddings

gen_table_embeddings.py adds vector embeddings to the metadata using AzureOpenAIClient with text-embedding-ada-002. Combining table name, alias, and description, embeddings enable similarity-based table matching.

Converting Text to SQL with Iterative Refinement

text_to_sql.py transforms a text request into SQL by refactoring the input with gpt-4o, matching tables via embedding similarity, selecting a primary table, and generating an initial query. It refines this iteratively (up to three levels) with foreign key joins, finalizing the SQL with gpt-4o. Outputs use implicit syntax and are logged to sql_logs/.

Challenges and Solutions

Parsing PL/SQL required strong regex patterns, while LLM costs were managed with a $5 cap per script via AzureOpenAIClient. Output lengths were constrained to fit database fields, batching addressed large datasets, and embeddings paired with foreign key refinement handled multi-table queries.

Results and Future Potential

The system generates accurate SQL queries, supported by rich metadata and embeddings. Future improvements might add subquery support, real-time execution, or multi-language capabilities.

Conclusion

This pipeline shows how Python and LLMs can connect text to SQL, using PL/SQL routines as a foundation for natural language queries. With AzureOpenAIClient streamlining API interactions and text_to_sql.py driving the conversion, developers can adapt this approach to their databases or enhance it with custom tweaks.

The full project repository is now available on GitHub at https://github.com/NukitaOkamu/text-lm-sql-v3/. Check it out to explore the code, contribute, or adapt it for your own text-to-SQL needs!