AI, or artificial intelligence; are you bored of hearing about of it yet? Between the stock market and CEO keynotes, we can't seem to get away from it. It promises to revolutionize everything around us. We'll have robots mowing our lawns, artificial intelligence teachers teaching us on our tablet computers, etc. We've all seen those incredible demos, yet when I want to build something useful with it, it comes down to a hard choice of rote theoretical concepts that seem so far away from usable code, or a matter of calling APIs with a key and giving up all my data to some random company out there. Where's that robot mowing my lawn?

The issue is that I don't want to offer up the corpus of my data to some random service out there. Maybe it's a data security or privacy issue, or maybe I need quicker response times. Maybe it's a matter of ongoing cost, or maybe I just don't have internet connectivity. For instance, wouldn't it be nice if I could just ask questions of “my” email over a simple chat-based UI, without having to share my life history with Apple or Microsoft? What if I was taking up a new job, and the new job shared 50 pages of fine-print details with me? I have many questions. I must accept the offer within 24 hours, and I have so many questions. I wish I could just ask questions. Or what if I had lots and lots of old financial data, and I wanted to ask a simple question, like “What is this $58.76 expense about?” and my computer had the intelligence to OCR all my receipts, and answer my questions in simple English, like “This $58.76 receipt was for tolls on your trip to customer xyz in city def”. Or maybe my lawn-mowing robot ran out of Wi-Fi range and needs to decide quickly if it's okay to mow over the rabbit? For the record, it is never okay to do that.

I've seen all these demos, yet when I sit down to solve these basic problems, all these promises by all these magnificent companies fall short.

Why can't I ask my computer such simple questions?

I'm a developer, so I set out to solve this problem. By the end of this article, I'll show you how you can build an application that works completely offline, sends no data to the cloud, and is able to answer simple questions from any source of knowledge. Sort of your own private ChatGPT.

For fun, I'll use my previous CODE Magazine article about VSCode tips (https://www.codemag.com/Article/2409031/More-VS-Code-Tips) and ask some simple questions about VSCode. But you can point this application to really any source of data—your company's documentation, or the health benefits of your HR department, or the state of the union, or the quarterly report of a company, or some deep research on quantum physics—and ask a simple question, like “What's an infinite well” or “What's the copay if I visit a specialist” or “Do I need a referral for a specialist in my medical plan” or “What's the chips and science act” and get legit answers.

Tell me: Would you find it useful if you had a PDF manual for your car, and could ask a simple question like, “Does the B2 service for my car involve a coolant flush?”

Let's build an application that takes in a corpus of any offline information and allows you to ask questions of it.

What You're Going to Need

To follow this article, you'll need a beefy machine. You're not going to rely on the cloud to build the model for you. You'll need a powerful local compute capability. This means either a higher-end Windows/Linux laptop, or one of the newer Macs. And yes, you'll need a GPU. AI involves a lot of calculations, and to speed things up, a lot of them are offloaded to the GPU. The difference between doing everything on the CPU vs. GPU is astronomical. For my purposes, I'll use my rusty trusted M1 Max MacBook Pro. It's a few years old, but it has enough oomph to work on thousands of pages of text, which is good enough for my needs. Hopefully, you have a similarly equipped machine, or to follow along you could just use a smaller input dataset.

Also, you'll download and use standard libraries, packages, and large language models that other companies and people have built. But when you're done with it, no data will be sent to the cloud, and your application will have the ability to be able to run completely offline. To get started, you'll need an internet connection.

Also, I'll use Python, so ensure that you have Python 3x installed.

Ollama

Ollama helps you run local large language models. There are many ways to run a large language model locally, and Ollama is one of the easiest ways to get started. Ollama is to large language models a bit like what Docker is to your code. Note that Ollama isn't the only way to run local language models, but it's one of the easiest to get started with and has a nice library of models you can pick from. There are many other choices when it comes to running language models locally. Some examples are hugging face transformers, LLaMA, Mistral, LocalLLaMA, Cerebras, etc.

The thing is, my puny little Mac isn't going to learn to understand English or Spanish on its own. It's going to need help. And large companies like Meta, Microsoft, and Google have spent billions of dollars of compute and terabytes of data to create large language models to get us started. Although I do want them to understand my corpus of data, it certainly helps that they already understand so much else. The model I downloaded is just a gigantic formula they've pre-built for me. All I have to do is call the formula with my inputs.

Picking the Right Model

You can see that Ollama already supports nearly any large language model you may be interested in here: https://ollama.com/library. There are many large language models that are general purpose, such as Gemma2 from Google, or Llama3 from Meta, or Phi3 from Microsoft. Each of these models is a snapshot of the world—around when the model was created and published. Each of these have their own weaknesses and strengths, some are designed for accuracy, some are designed for speed. You can also mix and match these models. For instance, maybe my target application is very code-centric. Why should I pay the overhead for the knowledge of what kind of oil a Hyundai engine needs? I would, instead of using Llama3, use Code Llama instead. Code Llama is a model that's designed for generating and discussing code. It's built on top of Llama2. It can both generate code and natural language about code.

When picking a model, there are a few things to consider. Naturally, every large tech company has a model to offer, and everyone will tell you their model is the best. The reality is that they're all imperfect. And depending upon how you choose to run a model, you may or may not be able to pick the model you like.

When picking a model, typically the model will be available in model or parameter sizes. For example, Llama3.1 is available in 8B, 70B, or 405B parameter sizes. That's B for Billion.

When we speak of large language models, model size refers to the number of parameters in the model. Parameters are the internal variables that the model uses to make predictions or generate text.

Think of parameters as the possible number of inputs to a formula. The more parameters a model has, the more complex and nuanced its behavior can be. However, increasing the number of parameters also increases the risk of overfitting. Overfitting is a common problem in machine learning, including large language models. It occurs when a model is too complex and learns the noise or random fluctuations in the training data rather than the underlying patterns and relationships. Think of overfitting like trying to draw a curve through a set of points on a graph. If the curve is too complex, it fits the noise in the data rather than the underlying trend.

Overfitting is when a model is too complex and learns the noise or random fluctuations in the training data rather than the underlying patterns and relationships

Larger models with more parameters perform better because they can capture more subtle patterns and relationships in the data, leading to better performance on a wide range of tasks. On the other hand, larger the model, the more computational resources you are going to need—more memory, processing power, and training data to learn effectively, which can be a challenge for smaller organizations or those with limited resources.

Parameter size isn't the only factor you should consider, though. There are a few other things you should look for when picking a model.

You should consider the architecture of the model, including the type of layers, activation functions, and attention mechanisms.

You should consider the training data used to build the model: The old adage applies, garbage in, garbage out.

Certain models are just better at certain things. For example, some models are great at code analysis, whereas others are great at translations, etc.

So, let's say we have Code Llama. Code Llama is a model for generating and discussing code. It's built on top of Llama 2 and it's designed to make workflows faster and more efficient for developers, and make it easier for people to learn how to code. It can generate both code and natural language about code. And it supports many popular programming languages, such as Python, C++, Java, PHP, TypeScript (JavaScript), C#, Bash, and more.

Tell me, would you find it useful if you had a locally running model that you could point your code to and it explained in English what the code does? Would that be helpful for generating documentation or comments? Or perhaps reviewing pull requests? What about when you're working with a team that insists on writing in Java, but you're a C# person. And you need to convert a complex routine your buddy in the other team wrote from Java into C#. Code Llama can help.

Or what if you're dealing with something trivial yet frustrating, like writing a jQuery syntax to select every alternate row in a table. You know what? I downloaded Code Llama and ran this exact prompt, and here's what it gave me:

// Select every even row
$("table tr").filter(":even");
// Select every odd row
$("table tr").filter(":odd");

I don't know about you, but I'm impressed. How about a more complex task, this time to Llama3.1.

Write an Azure CLI script that provisions three Windows Server 2022 virtual machines behind a load balancer and that exposes port 443.

Honestly, I didn't have high hopes this would work. But it did. I wanted to paste the whole script here for you to see and review, but my editor insists I break every code listing into 70 characters or less, so I just said:

Please break down this script so it is of column width 70.

And it did. I've pasted the script in Listing 1 and you be the judge! Frankly, my socks were blown off, I'm never writing another script by hand again, and unfortunately this probably also means I'm going to fail all LeetCode interviews going forward. Why are those interviews relevant anyway in today's world?

Listing 1: Azure CLI script for a relatively complex task

#!/bin/bash

# Set variables
RESOURCE_GROUP="myResourceGroup"
LOCATION="eastus"
VM_NAME_PREFIX="myVm"
VM_COUNT=3
LOAD_BALANCER_NAME="myLoadBalancer"
VIRTUAL_NETWORK_NAME="myVirtualNetwork"
SUBNET_NAME="mySubnet"
PUBLIC_IP_ADDRESS_NAME="myPublicIpAddress"
SECURITY_GROUP_NAME="mySecurityGroup"

# Create resource group
az group create \
    --name $RESOURCE_GROUP \
    --location $LOCATION

# Create virtual network and subnet
az network vnet create \
    --resource-group $RESOURCE_GROUP \
    --name $VIRTUAL_NETWORK_NAME \
    --address-prefix 10.0.0.0/16 \
    --subnet-name $SUBNET_NAME \
    --subnet-prefix 10.0.1.0/24

# Create public IP address
az network public-ip create \
    --resource-group $RESOURCE_GROUP \
    --name $PUBLIC_IP_ADDRESS_NAME \
    --allocation-method Dynamic

# Create load balancer
az network lb create \
    --resource-group $RESOURCE_GROUP \
    --name $LOAD_BALANCER_NAME \
    --sku Basic \
    --frontend-ip-name $LOAD_BALANCER_NAME \
    --public-ip-address $PUBLIC_IP_ADDRESS_NAME

# Create backend pool
az network lb address-pool create \
    --resource-group $RESOURCE_GROUP \
    --lb-name $LOAD_BALANCER_NAME \
    --name myBackendPool

# Create health probe
az network lb probe create \
    --resource-group $RESOURCE_GROUP \
    --lb-name $LOAD_BALANCER_NAME \
    --name myHealthProbe \
    --protocol Tcp \
    --port 443

# Create load balancer rule
az network lb rule create \
    --resource-group $RESOURCE_GROUP \
    --lb-name $LOAD_BALANCER_NAME \
    --name myRule \
    --protocol Tcp \
    --frontend-port 443 \
    --backend-port 443 \
    --frontend-ip-name $LOAD_BALANCER_NAME \
    --backend-pool-name myBackendPool \
    --probe-name myHealthProbe

# Create security group
az network nsg create \
    --resource-group $RESOURCE_GROUP \
    --name $SECURITY_GROUP_NAME

# Create security rule for port 443
az network nsg rule create \
    --resource-group $RESOURCE_GROUP \
    --nsg-name $SECURITY_GROUP_NAME \
    --name myRule \
    --priority 100 \
    --source-address-prefix '*' \
    --destination-address-prefix '*' \
    --destination-port-range 443 \
    --access Allow \
    --protocol Tcp

# Create virtual machines
for i in $(seq 1 $VM_COUNT); do
    az vm create \
        --resource-group $RESOURCE_GROUP \
        --name ${VM_NAME_PREFIX}${i} \
        --image MicrosoftWindowsServer:WindowsServer:2022-Datacenter:latest \
        --size Standard_DS2_v2 \
        --vnet-name $VIRTUAL_NETWORK_NAME \
        --subnet $SUBNET_NAME \
        --nsg $SECURITY_GROUP_NAME \
        --admin-username azureuser \
        --admin-password P@ssw0rd1234!

    az network nic create \
        --resource-group $RESOURCE_GROUP \
        --name ${VM_NAME_PREFIX}${i}Nic \
        --vnet-name $VIRTUAL_NETWORK_NAME \
        --subnet $SUBNET_NAME \
        --lb-name $LOAD_BALANCER_NAME \
        --lb-address-pool myBackendPool
done

Anyway, back to models.

Playing with Llama3

Remember, models are not just about text, and it's not just GPTs. You have many interesting fun models around computer vision, audio, and so much more. If you're interested in checking out some of the models available to you, you should check out this website: https://huggingface.co/models.

Let's leave that for another day. For now, let's focus on asking the latest issue of CODE Magazine some basic questions.

To get started, go ahead and install Ollama using the big “Download” button here: https://ollama.com. Once Ollama is downloaded, go ahead and drag-and-drop it to your applications folder and launch it. Once you launch it, you should see a Llama icon in your menu bar. You can also visit http://localhost:11434 and it should show you a message saying “Ollama is running.”

What next? Ollama is just the engine. The real fun begins when you start downloading large language models and using them. There are many models you can try. I'll use Llama3. To run Llama3, launch terminal, and issue the following command:

ollama run llama3

Once you issue the command, you should see the prompt like that shown in Figure 1.

Figure 1: Ollama running Llama3
Figure 1: Ollama running Llama3

If this is the first time you've run Llama3, it first downloads the model. The model is a few gigabytes, so it might take a few minutes. If you don't have the model locally, it first downloads the model, as can be seen in Figure 2. Alternatively, you can set up the model ahead of time by issuing the following command:

ollama pull gemma2
Figure 2: Ollama pulling down a model for first time use
Figure 2: Ollama pulling down a model for first time use

Now that the model is running, let's ask it some questions. For instance, “What kind of oil does a 2023 Toyota Prius need?” You can see the results in Figure 3. Different models will handle this question differently. But I must say, Llama3 handled this pretty well.

Figure 3: Llama3 answers a totally random question.
Figure 3: Llama3 answers a totally random question.

Now for fun, switch off your internet connection and ask this question again. Yay! It still works. This means that you're running this model 100% locally. In fact, launch your activity monitor and view GPU history, and you'll notice that your model is really making good use of your local GPU, as can be seen in Figure 4.

Figure 4: Llama3 making use of my GPU
Figure 4: Llama3 making use of my GPU

For fun, I asked another question, “Write me a poem about CODE Magazine”, and here are the first four sentences of a 16-line poem it wrote for me.

*CODE Magazine, a tale so fine,

A haven for coders, where knowledge is divine.

Within its pages, wisdom abounds,

As developers share their secrets, and coding bounds.

*Okay, this is getting to be real fun. **“Write me a joke about Rod Paddock.”

Why did Rod Paddock bring a ladder to the SQL Server conference?

**Because he wanted to take his queries to the next level!

Okay, I can't let this go without taking a dig at Markus, so here goes.

**Why does Markus have big muscles? Give me a funny answer

**Markus has big muscles because he spent years lifting weights made of sauerkraut cans and doing squats on top of giant pretzels. Now, whenever he gets pumped up, his biceps start to glow with an otherworldly aura, making him the envy (and terror) of all the veggie platters in the land!

Ho ho ho! I should just say, “Write me a CODE Magazine article about AI.” Just kidding. I wrote this article by hand—or did I? Hmm.

You'll note that Llama3 isn't up to date on the latest facts and figures. For instance, ask it “What is the current stock price of MSFT” and it will either show you old information, or will just point you to online resources. Well, that's not great! What if I wanted to feed it real-time information, such as my company's HR policy around cracking jokes, so I could ask questions of it? You know, something useful, something a developer would need to do.

Putting On My Developer Hat

Okay, you can continue to have fun with Ollama. As long as you're happy with querying a world with only the information from when the model was created, it'll work great for you. Some facts and figures may never change, like how tall was Abraham Lincoln? Alas, he will never be any taller than he was. But new knowledge keeps getting created. And knowledge in my context is important.

What I want to do is access this model programmatically. I want to create a simple RAG application using LangChain, so I can feed in a PDF and ask questions relevant to it.

RAG

I hate acronyms, so let's define this new acronym. Retrieval-Augmented Generation (RAG) is the process of optimizing the output of a large language model, so it references an authoritative knowledge base outside of its training data sources before generating a response.

Large language models like Llama3 have been trained on vast volumes of data and use billions of parameters to generate original output for tasks. They can do many things, like answering questions, translating languages, and completing sentences. But they can't understand your domain and your data out of the box.

RAG extends the already powerful capabilities of LLMs to specific domains or an organization's internal knowledge base. Best of all, you can do so without having to pay for retraining the model. This makes it a very cost-effective way of extending an LLM so it becomes useful for your context.

To start, ensure that Ollama is running and create a new Python project. I won't delve into the specifics of Python here, so at a high level, I created a venv (virtual environment) with Python 3x, and I installed the following requirements:

langchain
langchain_community
pypdf
docarray

This is a matter of creating a requirements.txt with the above text and running the following command in terminal in your virtual environment:

pip install -r requirements.txt

With these requirements installed, I created an index.py file where I can start writing some code.

Let's start with something simple. Can I even call the local Llama3 model and ask it simple questions programmatically?

To begin, add the following imports to your code:

from langchain_community.llms import Ollama
from langchain_community.embeddings import OllamaEmbeddings

The first import you see is to a Python package called langchain_community. This package establishes a commonality between several third-party integrations via a common set of base interfaces. So, for instance, I'm importing llms.Ollama. I could just as easily import llms.OpenAI or llms.openai.AzureOpenAI, and many other such examples. This way, I can write code that's similar no matter what model I import. I could then write applications that go to OpenAI when online and to local Ollama when offline.

The second package being imported is langchain_community.embeddings. Embeddings in AI is a way of representing high-dimensional data as a set of vectors in a lower-dimensional space. These vectors, or embeddings capture the relationships between different data points in the original high-dimensional space. Think of it this way: When you're navigating through a city, a high-dimensional space is quite complex—it's a detailed 3D model of the city you're trying to find your way through. In this complex model, every street, every building, every restaurant, every place of interest is a point. This representation, although complete, can be very difficult to work with. Embeddings are a simplified map that captures all the essential relationships between these points. They make this data easier to work with by reducing the dimensionality of the data, while preserving the information you care for. This means that you get faster computation and better performance.

In terms of language models, you may want to think in terms of words as vectors in a way that captures their semantic meaning and relationships. Computer algorithms can capture this information by analyzing lots and lots of text in any given natural language and coming up with common occurrences of words and their relative interrelationships. So the word “deck” can have multiple meanings, but when you say PowerPoint deck or deck behind a house, the meaning of the word becomes clearer.

Using this model and embeddings in code is quite simple, as can be seen below:

MODEL="llama3"
model = Ollama(model=MODEL)
embeddings = OllamaEmbeddings(model=MODEL)

Once you've instantiated the object, using it is quite simple. I chose to invoke it with a simple question and gave it some additional hints to give me a short but funny answer (below). Really, I'm not interested in the biological aspects of eggs, even though it would be quite egg-straordinary. Just give me the funny bits and keep it short please. This can be done by invoking the model with an input text, as can be seen below.

out = model.invoke("What came first, chicken or egg? Give me a funny and short answer.")
print(out)

Running this gives me a simple output as below.

The age-old question!

Well, let's get to the bottom of this fowl play (get it?). According to expert sources (okay, I made them up), it was actually the egg that came first.

Why, you ask? Well, because dinosaurs used to lay eggs, and then they evolved into chickens. So, in a nutshell (or an eggshell?), the egg came before the chicken!

Well, that was egg-cellent. You can find the full code for this initial very simple example in Listing 2.

Listing 2: Calling an LLM using code

from langchain_community.llms import Ollama
from langchain_community.embeddings import OllamaEmbeddings

MODEL = "llama3"
model = Ollama(model=MODEL)
embeddings = OllamaEmbeddings(model=MODEL)

out = model.invoke("What came first, chicken or egg? Give me a funny and short answer.")
print(out)

I am good at puns with the word “eggs,” aren't I? Well, here is an egg-cellent way of finding great puns using the word “egg.” Just use the language model to answer “Tell me some pun words using the word egg.” Honestly, this saved me a lot of mental egg-cercise.

Okay fun stuff! Let's take this a bit further now.

As you can see, I'm doing “model.invoke.” But the word LangChain has the word “chain” in it. The idea is that I can build a chain for processing my output. For instance, I could have the output returned in any format. But I want to always see the output as a string, so I can print it out.

To achieve this, I'm going to import a parser, as shown below.

from langchain_core.output_parsers import StrOutputParser

Just like rest of the LangChain ecosystem, these parsers have been written in to define a base class so they can be used across many LLMs. With the parser imported, now I can build my chain as follows, and slightly modify my code:

parser = StrOutputParser()
chain = model | parser
out = chain.invoke("Give me some puns using the word egg")
print(out)

By doing so, I'm effectively still doing model.invoke, passing the output of that to the parser, and then writing out the results. Go ahead and run the application, you should see some egg-straordinary results. You can find the code for involving a chain in the LLM code in Listing 3.

Listing 3: Invoking the LLM using a chain

from langchain_community.llms import Ollama
from langchain_community.embeddings import OllamaEmbeddings
from langchain_core.output_parsers import StrOutputParser

MODEL = "llama3"
model = Ollama(model=MODEL)
embeddings = OllamaEmbeddings(model=MODEL)
parser = StrOutputParser()

chain = model | parser
out = chain.invoke("Give me some puns using the word egg")
print(out)

Now that you have a good foundation to build upon, you can take the application further. You can now create increasingly complex chains, so the next thing you need to do is add prompts. In fact, what you want to do next is, with the help of prompts, get the model to not use its existing knowledge, but instead provide it some context, and the answer must be given based on that input context. If the model has no idea how to answer a given question based on the context given, the model should simply say “I don't know!” The code for this can be seen in Listing 4. Let's walk through it.

Listing 4: Answering questions based on an input context

template = """
Answer the question based on the context below. 
If you can't answer the question, say "I don't know".
Context: {context}
Question: {question}
"""

prompt = PromptTemplate.from_template(template)
prompt.format(context="Here is some context", question="Here is a question")

parser = StrOutputParser()
chain = prompt | model | parser

out = chain.invoke({
    "context": "Glyphosates can often cause deadly cancers which can kill people.",
    "question": "Who is Abraham Linclon?"
})

print(out)

The first thing you see in Listing 4 is a prompt template. A prompt template is simply a set of parameters that the user can specify, which can then be used to generate a prompt for the language model. The idea is that you want to give some guidelines to the model, so you can hopefully get a more intelligent answer. Give it a pre-defined structure or format for inputting text.

Let's examine the template a bit closer

Answer the question based on the context below. If you can't answer the question, say "I don't know".
Context: {context}
Question: {question}

With a well-designed prompt template, you can clarify what you wish to achieve. Okay, so you're instructing the model to provide an answer with the given context, but not attempt to make up stuff it doesn't know with a degree of confidence.

With the prompt template, you can also provide context. That's the {context} placeholder you see. This is a way of providing relevant information that helps the model provide a better answer. This can be historical information, previous chats, or, as you'll see shortly, input from a CODE Magazine article, based on which I'd like to have questions answered.

With a prompt template, you can also tune the tone and style. For example, in the above template, I can simply add the text “Assume the tone of a comedian,” and the replies will be funny.

With a prompt template, you can also provide guiderails to the answer. For example, if I don't want wordy answers, and I want key details listed out, I can add the following instructions to my prompt template:

Instructions:
Keep your answers to less than a 100 words
Summarize answers as bullet points

Then you chain the prompt into the LangChain as usual, and finally when you invoke the chain, you provide some input and ask a reasonable question.

Now, what's odd here is that the answer to “Who is Abraham Lincoln?” is fairly common knowledge. In fact, if I asked this question of Llama3 directly without context, it would confidently answer that “Abraham Lincoln (1809-1865) was the 16th President of the United States.”

But, given the input context as shown in Listing 4, when I ask the same question, the model simply replies:

I don't know

Now let's ask some questions that the model can answer.

***What are Glyphosates?

**Glyphosates are a type of herbicide that can be toxic to humans and have been linked to an increased risk of cancer.

**How do people die?

**Based on the context, people die from the deadly cancers caused by glyphosate exposure.

**What causes cancer?

**Glyphosate

*I have to say that I'm incredibly impressed. Notably, ask the same questions in the Llama3 model using Ollama and run Llama3, you'll get a much different and more detailed answer. That is because Ollama running plain vanilla Llama3 lacks context or the prompt we provided.

Now change the context to something clearly inaccurate. For instance, I changed my context to, “Glyphosates are delicious and can be added to make a cake sweet,” and then I asked a question, “What makes cakes sweet?”

Before I tell you what the model would answer, I want you to look at the first question I asked of the model, where I simply asked “What are Glyphosates?” Note that my context never mentioned that Glyphosates are an herbicide. The model knew it was an herbicide. This is based on the massive amount of training the model has had, so it isn't giving you clearly inaccurate answers. Of course, given enough inaccurate context, and the model will start spewing out nonsense. Similarly, for the second question “How do people die?” the answer clearly says, “Based on the context…”

Take a step back and think for a second. What if I gave you, a human being, lots of inaccurate context. Would you say the earth is flat? At one point, a large percentage of population did believe the earth was flat. Heck. They killed Galileo over this argument, and the poor chap was right. It really makes you wonder what we're being lied to about today.

Anyway, so what makes cakes sweet? Per the context, it's Glyphosates, but the model based on its previous learning knows there's something off. And it replies as follows.

I think there may be some confusion here!

Glyphosate is actually an herbicide, commonly known as Roundup, used in agriculture. It's not typically used in baking or making cakes.

To answer your question, what usually makes cakes sweet is the addition of sugar, honey, maple syrup, or other sweeteners. The type and amount of sweetener can vary depending on personal preference and recipe.

So, to summarize: Glyphosate doesn't make cakes sweet!

Okay now that you can provide context and ask questions based on context, and you understand how this works, let's enhance it further. I'm next going to provide the last issue of CODE Magazine as input to my model. And then I can ask questions based on the last issue of CODE Magazine. The input format will be PDF. The code for this can be seen in Listing 5.

Listing 5: Chain a PDF into my langchain

loader = PyPDFLoader("CodeMagJulAug2024.pdf")
pages = loader.load_and_split()

vectorstore = DocArrayInMemorySearch.from_documents(pages, embedding=embeddings)
retriever = vectorstore.as_retriever()
docs = retriever.invoke("programming")
print(docs)

template = """
Answer the question based on the context below. 
If you can't answer the question, say "I don't know".
Context: {context}
Question: {question}
"""
prompt = PromptTemplate.from_template(template)
parser = StrOutputParser()

chain = (
    {
        "context": itemgetter("question") | retriever,
        "question": itemgetter("question"),
    }
    | prompt
    | model
    | parser
)

exit = False
while not exit:
    question = input("Ask a question: ")
    if question == "bye":
        exit = True
    else:
        print(f"Answer: {chain.invoke({'question': question})}")

There are some things in Listing 5 that are immediately familiar. The output parser being used, the template with the context, and the question are all concepts I've already discussed. The new part here is that I'm using PyPDFLoader to load up an input PDF, and creating a vector store using an in-memory search object.

PyPDFLoader is one of the many document loaders available in langchain_community. Each of these document loaders takes the task of converting a given input into documents that can be used to create a vector store. I encourage you to explore the various other document loaders available in langchain_community.document_loaders. For instance, a very common task you'll run into is to create a vector store directly from a website. Here is how you do that:

from langchain_community.document_loaders import WebPageLoader

loader = WebPageLoader()
doc = loader.load("https://www.codemag.com")

Here's another fun loader for you to try. Check out the langchain_community.document_loaders.blob_loaders.YoutubeAudioLoader. Wouldn't it be fun to load up YouTube audio of a video and just ask a question? How many times have you seen an influencer talk about random clickbait stuff for the first nine minutes before getting to the point in the last minute of a 10-minute video? What about WikipediaLoader? Yes, that's there too. Try it out.

The next interesting thing you see in Listing 5 is the line below.:

vectorstore = DocArrayInMemorySearch.from_documents(pages, embedding=embeddings)

For a simple PDF this is fine, but for larger sets of data you'll want to use something persistent. This means that every time you run the program, it starts at zero. And it may take a few minutes to ingest a PDF document, so this can get really annoying. For the sample application this is fine, but in the real world, you'll probably want to save the vector store, maybe remove documents from it, or add to it, without having to recalculate everything.

There are many ways to achieve persistence across runs. There are many available classes in langchain_community.vectorstores that help you target alternate storage locations. For example, you can use Chroma DB. To use Chroma DB, install the db in your project as follows:

pip install chromadb

Once Chroma DB is installed, you can import it into your Python code as follows:

from langchain_community.vectorstores import Chroma

Once imported, you can use Chroma DB to create a persistent store, as shown below:

vectorstore = Chroma.from_documents(pages, embedding=embeddings, persist_directory="./codemag")

retriever = vectorstore.as_retriever()

The first time you run this code, it will take some time to crunch up the PDF. Once it's done with it, it will save all its work in a directory called “codemag”. Next time you run the program, you can simply check for the existence of the codemag folder, and if it exists, simply load up the vector store, as shown below:

vectorstore = Chroma(persist_directory="./codemag", embedding_function=embeddings)

Because you skipped all the hard calculations, this loads almost instantly and is ready to use. The full code using Chroma DB and persistent storage can be seen in Listing 6. Anecdotally, when I ran this on my M1 Max, I had MS Word, VSCode, Chrome, and Safari running. The first time I ran this, it took me around seven minutes, and this was one of the rare occasions I heard my M1 Max's fan come on. Well, it's nice to know the fan works. If you own a MacBook Pro, you know the fan nearly never comes on. The second time, the load was nearly instantaneous and the results were the same.

Listing 6: Persisting a vectorstore across runs

import os
from langchain_community.llms import Ollama
from langchain_community.embeddings import OllamaEmbeddings
from langchain_core.output_parsers import StrOutputParser
from langchain.prompts import PromptTemplate
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.vectorstores import Chroma
from operator import itemgetter

MODEL = "llama3"
model = Ollama(model=MODEL)
embeddings = OllamaEmbeddings(model=MODEL)
folder_path = "./codemag"

if os.path.isdir(folder_path):
    vectorstore = Chroma(persist_directory=folder_path, embedding_function=embeddings)
else:
    loader = PyPDFLoader("CodeMagJulAug2024.pdf")
    pages = loader.load_and_split()
    vectorstore = Chroma.from_documents(pages, embedding=embeddings, persist_directory=folder_path)

retriever = vectorstore.as_retriever()
docs = retriever.invoke("VSCode")
print(docs)

template = """
Answer the question based on the context below. 
If you can't answer the question, say "I don't know".
Context: {context}
Question: {question}
"""
prompt = PromptTemplate.from_template(template)
parser = StrOutputParser()

chain = (
    {
        "context": itemgetter("question") | retriever,
        "question": itemgetter("question"),
    }
    | prompt
    | model
    | parser
)

exit = False
while not exit:
    question = input("Ask a question: ")
    if question == "bye":
        exit = True
    else:
        print(f"Answer: {chain.invoke({'question': question})}")

From seven minutes to an instant start for a simple PDF? I'd call that an improvement. What does it mean when ingesting terabytes of data? You know, like that stupid HR manual that won't let me crack jokes at work.

Once you've created the vector store with the given embeddings, you create a retriever from it. Once you have a retriever, you can choose to invoke it. When you invoke it, you can give it some input context and it returns the top four documents from the input source for the specified input query to the retriever.invoke function. This is, of course, configurable.

Now I know I wrote some stuff about VSCode in the CODE Magazine article for July/August 2024, so when I invoke retriever.invoke with “VSCode” as an input parameter, indeed the retriever shows me four locations in the PDF where VSCode was most relevant. This can be seen in Figure 5. Of course, I knew this already because I diligently read the magazine cover to cover. Or did I? Hmm…

Figure 5: VS Code returns results
Figure 5: VS Code returns results

But wait! There's more! I can now chain the retriever in my chain, and now run a loop to allow the user to ask questions of my document in a free-form manner. As you can see from Listing 5 until the user says “bye,” you can, in a loop, allow the user to ask whatever questions the user wishes to ask.

Let's try a few inputs.

**Who is a huge fan of VSCode?

**Based on the context, it seems that the author of this text is a huge fan of VSCode. They mention using VSCode for various purposes, including taking meeting notes in markdown, and appreciate its features such as Emmet expansion, multiple cursors, and code selection.

**How do you hide files in VS Code?

**According to the context, you can hide files via a setting. For instance, the setting below in your .vscode\settings.json folder hides the node_modules folder from your view, even though it may exist on the disk.

{
    "files.exclude": {
        "node_modules": true
    }
}

This setting can be found in your .vscode\settings.json file.

Try comparing the results you see in these questions with what's mentioned in the article. The answers are being pulled out of the input document you provided. Now, this knowledge about hiding files is public information. Try asking this question directly of Llama3, and you'll get a much more generic answer.

Now let's try asking for something that's only mentioned in the July/August article.

**How do I create the CustomerController class?

**Based on the provided context, it seems that you are implementing an ASP.NET Core application with MVC. To create the CustomerController class, you would typically follow these steps:

.. (detailed answer omitted for brevity)

Honestly, look at the article in its original form and compare this answer. It's almost like this answer gives some much more actionable information.

Now is this model perfect? No. The input PDF had a lot of structure. For instance, nowhere does it say “Author of this article is .. xyz”, instead it just shows the author bio as a sidebar. This is where you could spend some effort tweaking input, including things like image recognition to make your data corpus more useful. You'll find methods on the vectorstore object to add meaningful contexts to make the LLM more useful. Think of it like the “best bets” feature in a search engine. If you see your users asking the same question over and over again and the LLM giving a less than ideal answer, just teach your LLM a bit by adding that context in plain English. Maybe in my next article I'll do it in Spanish too, all using AI of course.

But if you're faced with a large tome of boring unstructured text and you want to quickly make sense of it, this is a fantastic way of achieving that goal.

Wouldn't it be nice, if you were reading the script of a complex TV series like, say, Game of Thrones, and getting really confused between all those characters and mythical lands, and you could simply ask a question like:

**Who were the Lannisters?

**According to the context, the Lannisters are a noble family from Casterly Rock, and the following individuals are mentioned as being part of this family:

  1. Tyrion (also known as Imp or Halfman), who is the narrator and a dwarf.

  2. Tywin (Tyrion's father), who was the Hand of the King for twenty years.

  3. Cersei (Tyrion's sister), who is married to King Robert Baratheon.

  4. Kevan Lannister (Cersei's brother), who is mentioned as part of the party traveling with King Robert.

You can imagine what an unimaginable leap this is for me. I always fall asleep halfway into each episode, I can finally wrap my head around all those characters and appear somewhat knowledgeable. So when I'm suddenly woken up because the Mrs. called the water works over some emotional scene, I can consult my LLM to quickly get up to speed with what's going on, and precisely calculate the consoling time required before we can switch to the sports channel. AI is good for my social life.

As impressive as this is, remember, all this is running locally on my very simple off-the-shelf Mac. You can easily tweak this to use OpenAI and GPT4, and the answers will be far more accurate. But it's only a matter of time before local models become more powerful and, with more fine tuning and better prompt engineering, locally running models are very useful at this point. This is why I like Apple's approach to AI so much: because they have split AI responsibilities to local first and provisioned in the cloud on-demand when needed.

Summary

I've been in this industry for a few decades now, and when I saw the first gigabyte hard disk roll around, I was floored at the amount of storage it had in such a small space. I started doing some research on how much storage a human mind has. How much compute does a human mind have? I knew right there that, within my lifetime, I'll see computers have more storage and compute power than a human being. The cloud was unimaginable back then, but the cloud has brought all that reality to the forefront much sooner than I'd imagined.

What's the storage and compute power of the cloud? And what will we do with it?

What I find even more amazing though, is how much power our local devices have. The next version of iOS will do a bunch of AI compute locally and punt to the cloud where necessary. The newest ARM chips that Windows PCs run on have an NPU, a neural processing unit. The pixel phone has always been more about the software than the hardware.

Sure, Windows ships with some features that leverage AI, but I truly feel the real power will be unlocked by developers around the world once these PCs with NPUs are commonplace and there's a serious developer story around them. Even today, we're only beginning to tap into the value of local AI.

Let's fast forward a bit. In a world where every local device will have local AI capabilities, what will using a computer be like? Will I be able to ask my computer to sift through all the pictures of a loved one I lost, create a vision depth video, and let me relive the moments with an Apple Vision Pro and literally interact with an absent loved one in a manner so convincing that reality and imagination start to blend together?

Creepy? Or cute? You decide.

I can tell you, when photographs were invented, people had the same misgivings. People didn't want their picture taken because they were afraid the camera captures their soul. And yet here we are.

Pair this with AR (augmented reality). What I wouldn't give to call out to my dog, who unexpectedly passed away at the young age of four, just one more time, and have him come running to me, like he always did, with a ball in his mouth.

Would it not be truly amazing to have AR glasses that let me zoom, autocorrect for my vision, see in the dark, and that shows me notifications and directions, uses AI and measured biometrics to ask me to chill a bit and saves me from a heart attack, detects important biomarker changes, and persuades me to get screened for cancer?

Or how populations' minds will be controlled by feeding them convincing but incorrect stories, literally making “seeing is believing” obsolete. What are the implications of such power used for good or for bad?

The possibilities are both scary and feel like a next-level evolutional leap. I'll see it, all in this lifetime.

What an incredible time to be around. Until the next time!