The Enthusiasm
These days you can’t open social media without being bombarded by conversations about AI/ML, LLM, GenAI, ChatGPT, and Bard. It’s exciting but there’s only so much AI stimulation I can take before I start to go numb to the terms. Every product under the sun suddenly has AI baked right into it.
Don’t get me wrong! This is an important time in history and an opportunity I want to be a part of. How do I get involved?
This set of blogs will try to document some of my personal GenAI exploration. Hope that it helps you and or inspires you in some way to start your own journey into your unknown.
My Spark
As a kid, I was enamored with magic. Who am I kidding, I’m still enamored with it. 😅 There was this magic shop, Earnie Num Yukers, in the mall that I loved to frequent. I could stay in that store watching the magicians do tricks all day if my mom would have let me. I’d save up birthday or chore money and from time to time, buy a new trick. I still have that box of magic tricks somewhere in the garage. There’s something exciting about understanding a topic that appears magical.
Software was also initially magical and that’s probably what gravitates me to the technology field. More recently, Data Science beckons me with a similar mystique.
My Journey
About 5 years ago I took an online college Algebra class with intentions to transition into the field of Data Science. The thought was that a strong foundation in math was a precursor to transitioning. I completed the Algebra class and started Calculus but I also started to question what percent of the knowledge would be useful. Would I even like the work once “prepared enough”?
Life steers my attention in other directions at that time but recent advancements with tools like ChatGPT and Bard re-spark my enamor with magic. With my renewed fascination from their magic, I find myself trying to figure out how they do the trick. This time around I plan to take a different approach and let my natural curiosity steer exploration instead of making a huge investment in developing broad upfront knowledge.
When faced with something new, I prefer to accelerate my learning by soliciting advice from those with experience. With suggested starting points and some homework material, I head out to develop the terminology, get oriented, and establish a base understanding of the high level processes. Once I have enough understanding to clumsily navigate, it’s more effective for me to step away from researching theory and into hands-on time with real products, configuration files, and code. Real, working tools enable me to experiment and feel grounded in a base understanding that I can always return to should I get lost during the exploration process.
Starting the Learning Journey
I’m new to GenAI so the post won’t cover deep topics. At the same time, I won’t attempt to address all of the fundamentals. The blog will capture an overview of my hands-on experience with LLMs.
The Task at Hand
I’ve managed to get myself enrolled into a couple of the GenAI learning and certification paths at work. It’s a pretty simple process. Watch a bunch of pre-recorded videos on the fundamentals, take a test, and then complete a Capstone project. I’ve completed everything except for the Capstone which stands before me and my goal.
The Capstone:
- Create a hands on demo using GenAI products to solve a compelling use case
- For me the harder goal: Come up with a compelling use case 🤔
I can experience paralysis when it comes to finding an idea. After spinning my wheels on lackluster ideas, I posted a solicitation for others to work together as part of a small group. I love working in a small group to openly bounce thoughts. Lucky for me, two others also wanted to collaborate. They had multiple ideas so we’re able to make a quick decision and get to work.
Our Use Case
The customer wants the ability for non SQL users to use “natural language” to query a database. They have a few requirements.
- Business users should be able to query BigQuery using natural language
- A metadata table will be used to specify queryable tables.
- Solution should support multiple tables (JOIN statements)
Hypothetical ‘Natural Language’ Examples
- Summarize the financials for Q3
- What are the 10 best-selling products and SKUs this month?
- List names of departments that exceeded their budgets last month.
Approach
Due to my limited experience with LLMs, I had assumed that ChatGPT is the LLM. In my mind, it means that I chat with it and it returns answers much like this image portrays.
Bard and ChatGPT make it look easy so I figured that we would be experimenting with writing different prompting to guide the magic crystal ball to deliver what we want.
Roll up them Sleeves
It doesn’t go as smoothly as the image above.
Getting Started
For this experiment, we used Google’s Vertex AI which is a commercial grade ML platform that businesses can use to accelerate development of their AI use cases. We started dabbling with prompting in the Generative AI Studio to get familiar with the terminology, concepts, and basics of using an LLM. With regard to generating SQL, we learned that you have to get very explicit with table names, column names, and query logic to have the LLM generate a useful SQL query.
After realizing that the LLM on its own is pretty limited with SQL, we did some searching and learned that others have used LangChain to layer in additional SQL capabilities.
I used to work around Data Scientists so I’ve heard of ‘notebooks’. I remember they often contained R code and I had assumed they’re mostly for super nerdy scientific things. This capstone is a perfect opportunity to create my first notebook and I was delighted to learn that they’re not mysterious or intimidating. A notebook is essentially a simple IDE that allows grouping blocks of code and notes to experiment and test ideas.
Notebook Setup
I’m fluent enough with a hammer, screwdriver, wrench, and Python to pump out an acceptable solution most of the time. The notebook started out pretty standard as python goes.
- Code to install the python packages
- Set up the ENV variables and configuration
- Write some import statements
- Set global python vars.
- Found wrapper functions and copy/pasted them into the notebook.
Read About The Wrapper Code
When I encounter new code that I have no understanding of, I look up the libraries that are in the import statements. I start w/ the imports of the wrapper code to get a high level understanding of the purpose of the library. Looking at this code, here’s where I go.
- What is pydantic?- Google search and read a bit. It looks very data validation oriented. Important but not where I need to go deep. Not yet anyway.
- LangChain LLM- Okay, this one looks important. I’m asking myself “what is it?”
- Luckily their docs are pretty solid
- The getting started looks intuitive
- It looks straight forward enough that I wonderwhy do I need LangChain?’ It seems too simple
- Backing it up a bit- LangChain Introduction
- This book is super interesting!
- I find it so interesting that I read all the chapters even though I know that I don’t need to worry about some topics right away. Check it out, it’s a fun read.
- Back to the code. Now that I’m familiar with some library basics, on to the functions
- Read through each function- I don’t understand all the details but a have the general idea.
- Start trying to use the functions- “llm._call(“A tiger has”). – Response=“stripes”
- Find some sample data to play with
- Browse through the BigQuery sample datasets.
- Some datasets are much more complex than I need. Others don’t seem to have multiple tables that I can test a JOIN against.
- Here’s an ncaa basketball dataset that has table dependencies and it’s easy to understand. Tables for NCAA teams and team mascots look simple to test.
Writing Some Code
The first code we write is to stand up database infrastructure needed.
- Initialize the BigQuery client
- Create schema, tables, and load the datasets
Simple SQL Queries to Validate Connectivity
# @title Test query on each table
for t in table_names_list:
query = """SELECT * FROM `{PROJECT_ID}.{DATASET_ID}.{t}` LIMIT 10""".format(
PROJECT_ID=PROJECT_ID, DATASET_ID=DATASET_ID, t=t
)
print(engine.execute(query).first())
SUCCESS! We’re able to use our BigQuery client to query the database.
Results:
'fe21a988-3e96-4f3c-8f9f-b449ccda43b0', 'Texas A&M-CC', 'Islanders', <OMITTED>
'UCLA', 'UCLA', 'Bruins', 'ec0d6b67-4b16-4b50-92b2-1a651dae6b0f', <OMITTED>
Write a Function for the Natural Language Query
def bq_qna(question):
#create SQLDatabase instance from BQ engine
db = SQLDatabase(engine=engine,metadata=MetaData(bind=engine),include_tables=table_names_list)
#create SQL DB Chain with the initialized LLM and above SQLDB instance
db_chain = SQLDatabaseChain.from_llm(llm, db, verbose=True, return_intermediate_steps=True)
#Define prompt for BigQuery SQL
_googlesql_prompt = """You are a BigQuery SQL expert. Given an input question, first create a syntactically correct BigQuery query to run, then look at the results of the query and return the answer to the input question.
Unless the user specifies in the question a specific number of examples to obtain, query for at most {top_k} results using the LIMIT clause as per BigQuery SQL. You can order the results to return the most informative data in the database.
Never query for all columns from a table. You must query only the columns that are needed to answer the question. Wrap each column name in backticks (`) to denote them as delimited identifiers.
Pay attention to use only the column names you can see in the tables below. Be careful to not query for columns that do not exist. Also, pay attention to which column is in which table.
Use the following format:
Question: "Question here"
SQLQuery: "SQL Query to run"
SQLResult: "Result of the SQLQuery"
Answer: "Final answer here"
Only use the following tables:
{table_info}
Question: {input}"""
BigQuerySQL_PROMPT = PromptTemplate(
input_variables=["input", "table_info", "top_k"],
template=_googlesql_prompt,
)
#passing question to the prompt template
final_prompt = BigQuerySQL_PROMPT.format(input=question, table_info=table_names_list, top_k=100)
#pass final prompt to SQL Chain
output = db_chain(final_prompt)
return output['result'], output['intermediate_steps'][0]
Execute the First Natural Language Test
# @title Test `which mascots are birds?` (FAIL)
bq_qna('which mascot names are birds')
SQLQuery:SELECT DISTINCT mascot_name FROM mascots WHERE mascot_name LIKE '%bird%'
Results:
SQLResult: [('Hokiebird',), ('Blackbirds',)]
Answer:Hokiebird, Blackbirds
> Finished chain.
('Hokiebird, Blackbirds',
"SELECT DISTINCT mascot_name FROM mascots WHERE mascot_name LIKE '%bird%'")
It works! Wow. That’s incredible!
Wait a minute. 🤔 I remember another bird name in that dataset that I don’t see in this list.
Looking More Closely at the SQL Query
"SELECT DISTINCT mascot_name FROM mascots WHERE mascot_name LIKE '%bird%'")
It’s a LIKE query! So it only matches the word ‘bird’. It’s not thinking about each mascot to consider if it’s a bird.
It has me thinking at this point that my ‘natural language’ question may not be interpreted the way that I assume it would have.
Run Another Test
Something more conceptual but along the same lines
# @title Test `which mascots can fly?` (FAIL)
bq_qna('which mascot names can fly?')
Results:
SQLQuery:SELECT DISTINCT mascot_name FROM mascots WHERE mascot_name LIKE '%fly%'
SQLResult: []
Answer:[]
> Finished chain.
('[]',
"SELECT DISTINCT mascot_name FROM mascots WHERE mascot_name LIKE '%fly%'")
This query also always uses a LIKE query for %fly%. I’m wondering exactly what most people assume by the term ‘natural language’.
For someone that’s familiar with SQL, there’s not a huge advantage to using natural language such as in these examples.
Natural:
Show me all the mascot names from the mascot table
SQL:
SELECT name FROM mascots;
Natural:
Show me all mascot names with bird in the name
SQL:
SELECT name from mascots WHERE name LIKE %bird%
Just what do most people think they get in regard to using natural language to query a database. For me, the above really lacks ‘magic’ so I think we need to prove that a query like this is possible.
Next time we pivot and forge ahead!