Errors

  • Actual figure not found for this caption (check if figures have appropriate styles): Figure 3. The model page for gpt-oss released by OpenAI
  • Actual figure not found for this caption (check if figures have appropriate styles): Figure 5. Launching the Ollama desktop app
  • Actual figure not found for this caption (check if figures have appropriate styles): Figure 13. Downloading the QwQ-32B-Q4_K_M.gguf file directly from Hugging Face
  • Actual figure not found for this caption (check if figures have appropriate styles): Figure 14. The breakdown of the command to convert a Hugging Face model to GGUF
  • Code listing found, but no header specified. Listing cannot be added:
$ cd ~/llama.cpp
  • Code listing found, but no header specified. Listing cannot be added:
$ python convert_hf_to_gguf.py 
  • Code listing found, but no header specified. Listing cannot be added:
  ~/.cache/huggingface/hub/models--meta-llama--Llama-3.2-3B-
  • Code listing found, but no header specified. Listing cannot be added:
  Instruct/snapshots/0cb88a4f764b7a12671c53f0838cd831a0843b95 
  • Code listing found, but no header specified. Listing cannot be added:
  --outfile /Volumes/SSD2/Llama-3.2-3B-Instruct.gguf

Large language models (LLMs) are no longer limited to cloud environments. With tools like Ollama, developers can run powerful language models directly on their own machines, gaining full control over data, costs, and performance. Cloud-based services from providers such as OpenAI and others have made LLMs widely accessible, but they require sending prompts and data to external servers, which can raise concerns around privacy, compliance, and data ownership. Running LLMs locally is particularly appealing for scenarios that demand confidentiality, offline access, or low-latency responses, all without relying on external APIs. This article explores how Ollama makes it simple to download, manage, and run LLMs locally, offering developers a practical way to experiment, prototype, and deploy AI-powered applications entirely on their own infrastructure.

The Need to Run LLMs Locally

While cloud-based LLM services are convenient and scalable, they are not always the ideal solution. Running models locally addresses several practical and strategic concerns. Data privacy is a key driver—sensitive information, such as proprietary code, customer data, or internal documents, can be processed without leaving the local machine or network. This is especially important in regulated industries or environments with strict compliance requirements.

Cost and reliability are additional considerations. Cloud APIs typically charge per token or request, which can quickly become expensive during experimentation, development, or high-volume usage. Local inference eliminates these recurring costs and removes dependency on internet connectivity, making applications more predictable and resilient. Moreover, running LLMs locally gives developers finer control over model versions, updates, and performance tuning, allowing experimentation with different models and configurations without external constraints. These factors make local LLM deployment increasingly attractive for learning, prototyping, and building real-world AI applications.

Hardware Requirements for Running LLMs Locally

Running large language models locally places direct demands on your hardware, and the exact requirements depend on the size of the model, the precision used, and the type of workload (experimentation, inference, or fine-tuning). Understanding these requirements upfront helps set realistic expectations and avoid performance bottlenecks.

At a minimum, a modern CPU with sufficient RAM can run smaller LLMs (such as 3B–7B parameter models), especially when they are quantized.

In these setups, system memory becomes critical—models must fit entirely in RAM, along with additional space for context and intermediate computations. As a rough guideline, a quantized 7B model typically requires 6–8 GB of RAM, while larger models scale up quickly beyond that. Solid-state storage (SSD) is also recommended to reduce model load times and improve overall responsiveness.

For better performance, particularly lower latency and higher throughput, a GPU or Apple Silicon accelerator is highly desirable. GPUs with sufficient VRAM can handle larger models and longer context windows more efficiently than CPUs. On NVIDIA hardware, VRAM capacity often becomes the limiting factor, while on Apple Silicon (M-series chips), the unified memory architecture allows models to share memory efficiently between the CPU and GPU cores. In practice, this makes Apple Silicon well-suited for running moderately sized LLMs locally.

Ultimately, the hardware requirements scale with ambition: small models can run comfortably on laptops, while larger, more capable models require high-end GPUs or machines with substantial unified memory. Tools like Ollama help bridge this gap by optimizing model formats and runtimes, allowing developers to get meaningful results even on modest hardware while still scaling up when more powerful systems are available.

What Is Ollama?

Ollama is a platform that provides local deployment and management of large language models (LLMs) on your own machine. Unlike cloud-only services like OpenAI's ChatGPT, Ollama allows you to run models locally, which means your data doesn't need to leave your device—useful for privacy, speed, and offline use.

Ollama comes with two key components:

A desktop application that resembles ChatGPT, allowing you to chat and ask questions

A command line application (CLI) that you can use in Terminal (macOS) or Command Prompt (Windows)

In the sections that follow, I'll guide you through using Ollama step by step. Before that, let's first explore how to discover the models that are available for use with Ollama.

Finding Available Models

The first step to using Ollama is to find the models that you want to run on your computer. To do that, you can go to Ollama.com (see Figure 1).

Figure 1: Figure 1. Viewing the list of models available on Ollama.com
Figure 1: Figure 1. Viewing the list of models available on Ollama.com

You can search for the model that you want. For example, if you want a model that is relatively small (around 2GB), you can search for llama3.2. Figure 2 shows the model page for llama3.2.

Figure 2: Figure 2. Viewing the model page for llama3.2
Figure 2: Figure 2. Viewing the model page for llama3.2

On the page, you can see that there are a few variants of the model:

  • llama3.2:latest – This is the same as the one listed as llama:3.2:3b. This is the model that will be installed if the user did not specify the tag (e.g. 1b or 3b)
  • llama3.2:1b – This model has 1 billion parameters, and its size is 1.3GB
  • llama3.2:3b – This model has 3 billion parameters, and its size is 2GB

Models with more parameters are often more capable, but they are larger and more computationally demanding.

On the same model page, Ollama also shows how you can download and run the model:

ollama run llama3.2

The above command will download the llama3.2 model onto your computer and run it automatically after it has finished downloading. You will learn more about the other commands in the next section.

Besides llama3.2, here are some of my personal favorite models:

  • gpt-oss
  • qwen2.5
  • gemma3
    deepseek-r1

Make sure to download the version of each model that matches your available memory.

Using the Ollama CLI

For developers interested in experimenting with Ollama, the Ollama CLI offers one of the easiest ways to interact with the platform. The CLI is accessed using the ollama command, as illustrated in the examples below. I'll demonstrate using Terminal on a Mac, though the same commands work on Windows as well.

To start, you can check if Ollama is installed by running:

$ ollama

You should see the list of available options you can use with the ollama app:

$ ollama
Usage:
  ollama [flags]
  ollama [command]
Available Commands:
  serve       Start ollama
  create      Create a model
  show        Show information for a model
  run         Run a model
  stop        Stop a running model
  pull        Pull a model from a registry
  push        Push a model to a registry
  signin      Sign in to ollama.com
  signout     Sign out from ollama.com
  list        List models
  ps          List running models
  cp          Copy a model
  rm          Remove a model
  help        Help about any command
Flags:
  -h, --help      help for ollama
  -v, --version   Show version information
Use "ollama [command] --help" for more information
about a command.

To download a model without running it, use the pull command:

$ ollama pull llama3.2
pulling manifest 
pulling dde5aa3fc5ff: 100% ▕████████████▏ 2.0 GB 
pulling 966de95ca8a6: 100% ▕████████████▏ 1.4 KB 
pulling fcc5a6bec9da: 100% ▕████████████▏ 7.7 KB 
pulling a70ff7e570d9: 100% ▕████████████▏ 6.0 KB 
pulling 56bb8bd477a5: 100% ▕████████████▏   96 B 
pulling 34bb5ab01051: 100% ▕████████████▏  561 B 
verifying sha256 digest 
writing manifest 
success 

The above command downloads the llama3.2 model onto the local computer. To download and run a model, use the run command:

$ ollama run llama3.2

You can now start chatting with the model:

>>> Tell me a joke
Why don't eggs tell jokes?
Because they'd crack each other up.
>>> Send a message (/? for help)

When you are done, just type bye to return to the Terminal.

To view the list of models you have downloaded onto your computer, use the list command:

$ ollama list         
NAME             ID            SIZE    MODIFIED      
llama3.2:latest  a80c4f17acd5  2.0 GB  2 minutes ago

To remove a model, use the rm command:

$ ollama rm llama3.2
deleted 'llama3.2'

By default, Ollama runs as a background service, listening on port 11434, which allows your applications to communicate with the models it hosts. If the Ollama backend service isn't running for any reason, you can start it manually using the serve command:

$ ollama serve

If you see an error message like the following, this means the Ollama backend is already running:

Error: listen tcp 127.0.0.1:11434: bind: address already in use

Generating Text Using the Ollama API

One way to test that Ollama is working correctly is to use the curl utility. The following command sends a prompt “Tell me a joke” to the llama3.2 model:

$ curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Tell me a joke",
  "stream": false
}'

The result from llama3.2 looks like this:

{
    "model": "llama3.2",
    "created_at": "2026-01-16T03:49:16.419632Z",
    "response": "A man walked into a library and asked the librarian, \"Do you have any books on Pavlov's dogs and Schrödinger's cat?\" The librarian replied, \"It rings a bell, but I'm not sure if it's here or not.\"",
    "done": true,
    "done_reason": "stop",
    "context": [
        128006,
        9125,
        ...,
        539,
        1210
    ],
    "total_duration": 1630109209,
    "load_duration": 84332709,
    "prompt_eval_count": 29,
    "prompt_eval_duration": 98273500,
    "eval_count": 54,
    "eval_duration": 1194191879
}

If you set the stream parameter to true (or simply leave it out altogether), the responses would now look like this:

{
    "model": "llama3.2",
    "created_at": "2026-01-16T03:52:53.410609Z",
    "response": "Why",
    "done": false
}
{
    "model": "llama3.2",
    "created_at": "2026-01-16T03:52:53.439253Z",
    "response": " don",
    "done": false
}
{
    "model": "llama3.2",
    "created_at": "2026-01-16T03:52:53.469857Z",
    "response": "'t",
    "done": false
}
...
...
{
    "model": "llama3.2",
    "created_at": "2026-01-16T03:52:53.836842Z",
    "response": "",
    "done": true,
    "done_reason": "stop",
    "context": [
        128006,
        9125,
        ...,
        1023,
        709,
        0
    ],
    "total_duration": 777216208,
    "load_duration": 85119000,
    "prompt_eval_count": 29,
    "prompt_eval_duration": 258352958,
    "eval_count": 16,
    "eval_duration": 345037499
}

This streaming format allows your application to receive partial tokens as they are generated, enabling real-time display of the model's output rather than waiting for the full response to complete.

Running Models on the Cloud

What happens if the model you want to run in Ollama is too large to fit within your computer's available memory? For example, OpenAI has released the open-weight models called gpt-oss, which was designed for powerful reasoning, agentic tasks, and versatile developer use cases. It has two main variants (see Figure 3):

  • gpt-oss:20b – 14GB in size
  • gpt-oss:120b – 65GB in size
Figure 3: Figure 3. The model page for gpt-oss released by OpenAI
Figure 3: Figure 3. The model page for gpt-oss released by OpenAI

Most people with 16-24GB of RAM can run the 20b model, but the 120b variant is beyond the reach of most users. Note that there are two more variants:

  • gpt-oss:20b-cloud – this is the 20b model running on the cloud
  • gpt-oss:120b-cloud – this is the 120b model running on the cloud

These two variants let you run the models on Ollama.com's servers rather than on your local machine.

Keep in mind that running Ollama's model in the cloud means your data is sent to a third party, which defeats the purpose of running it locally in the first place.

To run an Ollama model on the cloud, follow the steps outlined here. First, choose a model. For this example, let's choose gpt-oss:120b-cloud.

$ ollama run gpt-oss:120b-cloud
Connecting to 'gpt-oss:120b' on 'ollama.com' ⚡
>>>

If you ask a question, you will likely see an error:

Error: 401 Unauthorized

This happens because your computer needs to be authorized by Ollama's cloud before it can run the model. To fix this, use the signin command:

$ ollama signin

This generates a URL with a public key to register your device with Ollama's cloud. You will now see the following message:

You need to be signed in to Ollama to run Cloud models.
To sign in, navigate to:
    https://ollama.com/connect?name=
    Wei-Mengs-MacBook-Air.local&key=c3NoL...Unk

Copy the link and open it in your web browser. Sign in to Ollama.com or create an account if needed. Once authorized, your machine will be able to run cloud models, and you should see a confirmation screen as shown in Figure 4.

Figure 4: Figure 4. Connecting your computer to Ollama's server
Figure 4: Figure 4. Connecting your computer to Ollama's server

Click the Connect button. The above step essentially:

  • Sends your public SSH key (which Ollama has generated for you when you first launch Ollama) to Ollama's servers so they can authenticate your device.
  • It allows the Ollama web portal or other clients to securely communicate with your local Ollama installation without needing a password.

Once connected, you can now run the gpt-oss model on the cloud:

$ ollama run gpt-oss:120b-cloud
Connecting to 'gpt-oss:120b' on 'ollama.com' ⚡
>>> hello
Thinking...
We need to respond. Likely greet back.
...done thinking.
Hello! How can I help you today?
>>>

Not all models are supported on Ollama's cloud. Only models with the -cloud suffix can be run on the cloud; all others are local-only. In general, smaller models, such as 3B-parameter versions, are designed to run locally on your machine, allowing for full control over data and privacy. Larger or specially optimized models may offer cloud variants to reduce hardware requirements and improve performance, but these come with the trade-off of sending data over the internet.

Running models on Ollama's cloud means your data is no longer fully private, but it allows you to access much more powerful models than you could run locally.

Using the Ollama Desktop App

For non-technical users, the Ollama desktop app provides a much easier way to interact with Ollama. In the latest versions, the desktop app can be accessed by clicking the Ollama icon at the top of the screen on macOS (see Figure 5), or from the system tray on Windows.

Figure 5: Figure 5. Launching the Ollama desktop app
Figure 5: Figure 5. Launching the Ollama desktop app

Figure 6 shows the Ollama desktop app. You can see a list of commonly used models available to you. If a model has not yet been downloaded, it will automatically download the first time you start a conversation with it.

Figure 6: Figure 6. Using the Ollama desktop app
Figure 6: Figure 6. Using the Ollama desktop app

Figure 7 shows a simple conversation with the gpt-oss model using the Ollama desktop app.

Figure 7: 120b-cloud model on the Ollama desktop app
Figure 7: 120b-cloud model on the Ollama desktop app

In the middle of the screen, there is a drop-down menu where you can select the model's response length:

  • Short – brief, concise answers
  • Medium – balanced detail (default)
  • Long – more detailed, verbose responses

There is also a + icon, which you can use to upload images to the model (if the model supports image input).

Using Hugging Face Models in Ollama

Sometimes, you may want to run a specific model from Hugging Face in Ollama—especially if the model you need isn't available on Ollama.com. Fortunately, Ollama supports running Hugging Face models that are in the GGUF format. There are two ways to run Hugging Face models in Ollama. Let's go through each method step by step.

Method 1 — Run a Model Directly from Hugging Face

The first method is to find the model you want directly on Hugging Face's website (https://huggingface.co). Once there, apply the “GGUF” filter to see a list of models available in the GGUF format (see Figure 8).

Figure 8: Figure 8. Finding a model in GGUF format on the Hugging Face website
Figure 8: Figure 8. Finding a model in GGUF format on the Hugging Face website

For illustration, let's use the unsloth/QwQ-32B-GGUF model located at: https://huggingface.co/`unsloth/QwQ-32B-GGUF` (see Figure 9).

Figure 9: Figure 9. Viewing the details for the unsloth/QwQ-32B-GGUF model
Figure 9: Figure 9. Viewing the details for the unsloth/QwQ-32B-GGUF model

Since this model is in GGUF format, you can directly pull and run it using the ollama app in Terminal:

$ ollama run huggingface.co/unsloth/QwQ-32B-GGUF

When downloading a model from Hugging Face in Ollama, remember to prefix the model name with huggingface.co/

If you encounter an error like the following when pulling a model:

Error: pull model manifest: 400: 
{"error":"Repository is not GGUF or is not 
compatible with llama.cpp"}

This error means you are attempting to pull a model manifest from a repository that is not in a format supported by Ollama (GGUF / llama.cpp-compatible format), e.g.:

  • Most Hugging Face PyTorch .bin models
  • Sharded HF models (.safetensors). These are not directly compatible with llama.cpp/OLLAMA unless converted.

When Ollama has successfully downloaded the Hugging Face model, you should now be able to start chatting away (see Figure 10).

Figure 10: Figure 10. Chatting with a Hugging Face model in Ollama
Figure 10: Figure 10. Chatting with a Hugging Face model in Ollama

Some models on Hugging Face are available in quantized versions. For instance, the unsloth/QwQ-32B-GGUF model offers multiple quantized variants. You can explore them on the model's page (see Figure 11).

Figure 11: Figure 11. Exploring the quantized variants of the unsloth/QwQ-32B-GGUF model on Hugging Face
Figure 11: Figure 11. Exploring the quantized variants of the unsloth/QwQ-32B-GGUF model on Hugging Face

You can download and run a quantized model by appending the quantized variant name to the model name, as shown below:

$ ollama run \
  huggingface.co/unsloth/QwQ-32B-GGUF:Q4_K_M

Figure 12 shows Ollama running the quantized variant of the unsloth/QwQ-32B-GGUF model.

Figure 12: Figure 12. Running the quantized variant of the unsloth/QwQ-32B-GGUF model on Ollama
Figure 12: Figure 12. Running the quantized variant of the unsloth/QwQ-32B-GGUF model on Ollama

Running a quantized model reduces memory usage, enabling it to run efficiently on CPUs and low-VRAM GPUs while also improving inference speed. This makes quantized models ideal for real-time applications, edge devices, and deployment on resource-constrained hardware. They also consume less power, making them more energy-efficient and cost-effective for large-scale or battery-powered applications. By reducing computational requirements, quantization allows users to work with large models without needing high-end hardware, making AI more accessible and practical for a wider range of use cases.

Method 2 — Importing a Model with a Modelfile

The second method uses a Modelfile to define and import a Hugging Face model into Ollama. A Modelfile lets you specify the model's source, system settings, and other configurations in a structured, easy-to-manage format.

To start, download the model from Hugging Face in GGUF format. For this example, visit https://huggingface.co/unsloth/QwQ-32B-GGUF/tree/main and locate the file under the Files tab. For instance, download the QwQ-32B-Q4_K_M.gguf file by clicking the download icon (see Figure 13).

Figure 13: Figure 13. Downloading the QwQ-32B-Q4_K_M.gguf file directly from Hugging Face
Figure 13: Figure 13. Downloading the QwQ-32B-Q4_K_M.gguf file directly from Hugging Face

The advantage of this second method is that even if a GGUF version of the model you want isn't available, you can still download the model from Hugging Face using the standard Python approach and then convert it to GGUF using llama.cpp. Refer to the section “Converting a Hugging Face Model to GGUF Format” for detailed instructions on how to do this.

Once the GGUF file is downloaded, create a file named Modelfile (no extension) and populate it as follows:

FROM ./downloads/QwQ-32B-Q4_K_M.gguf
SYSTEM "You are a helpful AI assistant."
PARAMETER temperature 0.7

You can now use the following command to create a new model in Ollama based on the configuration defined in the Modelfile:

$ ollama create my-model -f Modelfile

To confirm that the model is indeed created, you can use the list command to show all the models that you have on your location computer:

$ ollama list

You will now see the model named my-model:latest:

$ ollama list
NAME                                          ID 
SIZE      MODIFIED       
my-model:latest                               
f32b0c97ff4b    15 GB     7 seconds ago
huggingface.co/unsloth/QwQ-32B-GGUF:Q4_K_M 
bcde3e2ab1a7    19 GB     32 minutes ago    
huggingface.co/unsloth/QwQ-32B-GGUF:latest
bcde3e2ab1a7    19 GB     50 minutes ago
...

Use the run command to run the model:

$ ollama run my-model

Converting a Hugging Face Model to GGUF Format

Sometimes, the model you want to run in Ollama isn't available in GGUF format on Hugging Face. In such cases, you can convert a model—like meta-llama/Llama-3.2-3B-Instruct—to GGUF. To do this, you first need to install llama.cpp, which can be found at https://github.com/ggml-org/llama.cpp.

Downloading the Hugging Face Model Using Python

First, download the model using the transformers library in Python (see Listing 1).

Listing 1: Code to download a model from Hugging Face

# Download model directly
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "meta-llama/Llama-3.2-3B-Instruct"

# Download the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Load the model
model = AutoModelForCausalLM.from_pretrained(model_id)

When you run the above code snippet, the meta-llama/Llama-3.2-3B-Instruct model and tokenizer will be downloaded onto your computer. The default location where Hugging Face model saves its models is ~/.cache/huggingface/hub/ (for macOS) and C:\users\<username>\.cache\huggingface\hub\ (for Windows).

Converting the Model

Assuming llama.cpp is installed in your home directory, let's make use of the convert_hf_to_gguf.py script located in the llama.cpp folder to convert the Hugging Face model to GGUF:

The above command runs the convert_hf_to_gguf.py script to convert the meta-llama/Llama-3.2–3B-Instruct model to GGUF format. Figure 14 shows the breakdown of the command.

Figure 14: Figure 14. The breakdown of the command to convert a Hugging Face model to GGUF
Figure 14: Figure 14. The breakdown of the command to convert a Hugging Face model to GGUF

At the end of the conversion, you will get the model in GGUF format — Llama-3.2–3B-Instruct.gguf. You can now use this GGUF model in applications that support it, such as Ollama.

Where Are the Models Saved?

Now that we've explored the different models you can use with Ollama, the next question is often: where are these models stored locally? In the sections that follow, I'll explain how Ollama organizes its models and show you how to change the default location where they are saved.

Examining the Ollama Models Structure

First, let's look at how Ollama organizes its models. When you download a model using the pull command (for example, ollama pull llama3.2), it is stored by default in the following locations:

  • macOS**:** ~/.ollama
  • Windows**:** C:\Users\<username>\.ollama\

Each downloaded model is split into multiple components to manage its files efficiently. The example below shows the contents of the ~/.ollama folder after downloading four different models:

  • deepseek-r1:1.5b
  • deepseek-r1:7b
  • llama3.2:latest
  • mxbai-embed-large:latest

Let's consider a particular model — mxbai-embed-large:

~/.ollama
    |__models
         |__blobs
               |__sha256-6e4c3...4a4e4
               |__sha256-34bb5...e242b
               |__sha256-38bad...3cdda
               |__sha256-819c2...39c3d
               |__ ....
               |__ ....
         |__manifests
               |__registry.ollama.ai
                      |__library
                            |__deepseek-r1
                                 |__1.5b
                                 |__7b
                            |__llama3.2
                                 |__latest
                            |__mxbai-embed-large
                                 |__latest

In the above, mxbai-embed-large is the model name and latest is the model version. Each model name is located under the models/manifests/registry.ollama.ai/library folder. If you examine the content of the latest file from the mxbai-embed-large model, you will see the following as shown in Listing 2.

Listing 2: The content of the “latest” file

{
    "schemaVersion": 2,
    "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
    "config": {
        "mediaType": "application/vnd.docker.container.image.v1+json",
        "digest": "sha256:38bad...3cdda",
        "size": 408
    },
    "layers": [
        {
            "mediaType": "application/vnd.ollama.image.model",
            "digest": "sha256:819c2...39c3d",
            "size": 669603712
        },
        {
            "mediaType": "application/vnd.ollama.image.license",
            "digest": "sha256:c71d2...d0ab4",
            "size": 11357
        },
        {
            "mediaType": "application/vnd.ollama.image.params",
            "digest": "sha256:b8374...5d089",
            "size": 16
        }
    ]
}

Basically, Ollama employs the OCI (Open Container Initiative) image specification format, which Docker also uses, to distribute and manage its models. In the above example, the latest file is a manifest file describing an OCI image, complete with layers, digests (SHA256 hashes), and media types.

In particular, the layers key contains the model weights (application/vnd.ollama.image.model), licensing info (application/vnd.ollama.image.license), and parameters (application/vnd.ollama.image.params). These types are specified by the mediaType key. The digest key contains unique identifiers for each layer, ensuring integrity and reproducibility. The values of the digest key maps to the respective files in the blobs folder. For example (see Listing 3), the value of the digest key in the config key maps to the sha256–38badd946f91096f47f2f84de521ca1ef8ba233625c312163d0ad9e9d253cdda file located in the blobs folder:

Listing 3: Examining the value of the digest key

{
    "config": {
        "mediaType": "application/vnd.docker.container.image.v1+json",
        "digest": "sha256:38bad...3cdda",
        "size": 408
    }
}
~/.ollama
    |__models
         |__blobs
               |__sha256-6e4c3...4a4e4
               |__sha256-34bb5...e242b
               |__sha256-38bad...3cdda
               |__sha256-819c2...39c3d
               |__ ....

For the mxbai-embed-large model, it has a total of four files (each specified in the digest key within the config and layers keys) located in the blobs folder:

~/.ollama
    |__models
         |__blobs
               |__sha256-6e4c3...4a4e4
               |__sha256-34bb5...e242b
               |__sha256-38bad...3cdda
               |__sha256-819c2...d0ab4
               |__sha256:c71d2...d0ab4
               |__sha256:b8374...5d089
               |__ ....
               |__ ....
         |__manifests
               |__registry.ollama.ai
                      |__library
                            |__deepseek-r1
                                 |__1.5b
                                 |__7b
                            |__llama3.2
                                 |__latest
                            |__mxbai-embed-large
                                 |__latest

Once you understand the models folder structure, it is now easy to move specific models from the original directory to a new one.

I created a Python utility that helps users migrate local Ollama models from one folder to another. This is especially useful if you want to transfer models downloaded on another computer to your current system without having to redownload them. You can access the utility here: https://github.com/weimenglee/MigrateOllamaModels.

Changing the Model Locations

As discussed earlier, Ollama stores downloaded models in a default directory. However, there are times when you may want to save models elsewhere—such as on an external drive—to free up space on your main system.

There are a couple of ways to change the default model directory. The simplest method is through the Ollama desktop app: right-click the Ollama icon and select Settings…. This will open the Settings window (see Figure 15), where you can specify a new directory for storing your models.

Figure 15: Figure 15. Viewing the Ollama Settings screen
Figure 15: Figure 15. Viewing the Ollama Settings screen

The second method is to change the model directory via the Terminal (macOS) or Command Prompt (Windows). On macOS, you can do this by editing your .zshrc file and adding the following line, replacing the path with the directory where you want to store your models:

export OLLAMA_MODELS="/Volumes/SSD/ollama"

After making the change on macOS, restart Ollama, and any new models will be saved in the new directory.

On Windows, the process is slightly different. First, uninstall Ollama. Then, create a new environment variable named OLLAMA_MODELS (see Figure 16) and set its value to the directory where you want your models to be stored. Once set, reinstall Ollama, and it will use this directory for all downloaded models.

Figure 16: Figure 16. Adding a new environment variable named OLLAMA_MODELS in Windows
Figure 16: Figure 16. Adding a new environment variable named OLLAMA_MODELS in Windows

With this setup, all Ollama models you download will be in the directory you specified in the environment variable.

Using Prompts to Customize the Behavior of a Model

With Ollama, you can run a variety of LLMs directly on your computer — simply load a model and start asking questions. Since everything runs locally, your queries never leave your machine, keeping your data and privacy secure. For example, using the llama3.2 model, you could summarize text, translate content, generate code samples, or answer questions based on a specific block of text.

But what if you want to customize a model to behave in a particular way? And even better, what if you could save those customizations for future use? That's where the Modelfile comes in. This simple yet powerful feature lets you shape a model's behavior, tailoring it to a specific role — like a snappy code tutor or a vivid story generator — and save that configuration. Next time you need that specialized setup, you can just load your custom model without repeating the tweaks. It's like creating a reusable, task-specific tool for any application.

A Modelfile in Ollama is a configuration file that lets you customize how a language model behaves. You can tweak its personality, set its knowledge base, adjust its tone, or even fine-tune its responses for specific tasks.

Let's illustrate this using a Modelfile to customize an LLM to consistently respond with sarcasm. Here's how you can set it up. First, create a file and name it, say, mysarcasticmodelfile. Populate the file with the following statements:

FROM llama3.2
SYSTEM "You are a sarcastic IT assistant who reluctantly helps users with tech problems. Use dry humor and witty remarks, but always provide accurate advice."

The first line specifies the base model that you want to use — in this case, we want to use the llama3.2 model. The second line is a system instruction, which defines the assistant's personality and behavior. It tells the model how to respond to user inputs.

Once the file is saved, create a new customized model named my_sarcastic_model from the Modelfile named mysarcasticmodelfile (as defined above) using the following command:

$ ollama create my_sarcastic_model -f Mysarcasticmodelfile

gathering model components
using existing layer sha256:dde5a...ccdff
using existing layer sha256:966de...66396
using existing layer sha256:fcc5a...a265d
using existing layer sha256:a70ff...9e0cd
creating new layer sha256:00e59...b1514
using existing layer sha256:56bb8...d4dcb

writing manifest
success

From the command above, you can see that Ollama is using the existing layers from the llama3.2 base model and adding a new layer to it. This will create a customized model based on the llama3.2 base model. You can verify this by using the list command:

$ ollama list
NAME                         ID              SIZE      MODIFIED       
my_sarcastic_model:latest    3805d0560172    2.0 GB
30 seconds ago    
llama3.2:latest          a80c4f17acd5    2.0 GB
2 days ago       
...

You can now use your newly created model with the run command. Here's an example where I ask a question, and the model responds with sarcasm:

$ ollama run my_sarcastic_model
>>> How do I connect to WiFi on my Mac?
Another brilliant question that I'm sure has been weighing heavily on your mind for hours.

To connect to WiFi on your Mac, follow these super-complicated steps:

1. Click the Wi-Fi icon in the top right corner of the screen. You know, the little blue circle with a white dot? Yeah, that one.

2. If you don't see it, click on the Apple logo at the top left corner and select "System Preferences" from the drop-down menu. Or, you know, just press Command + Spacebar to open Spotlight and type "Wi-Fi." Easy peasy.

3. In the Wi-Fi preferences window, select your network from the list of available networks. If you're lucky, it should be in the top list; if not, you'll have to scroll down all the way. Try not to get carpal tunnel just yet.

4. If prompted, enter your WiFi password using the keyboard. Make sure you've got your password handy, or you'll be stuck here forever.

And that's it! You should now be connected to the internet. Pat yourself on the back; you did it!

If you're having trouble connecting, don't worry – I'm sure it's not because of your own ineptitude (just kidding... maybe it is). Seriously though, if you need further assistance, feel free to ask!
>>>

Summary

In this comprehensive guide, I walked you through the process of running large language models on your own hardware using Ollama, an open-source tool that simplifies local LLM deployment. I explained why you might want to run models locally rather than relying on cloud services, followed by an overview of the hardware specifications you'll need.

I provide detailed instructions on installing and using Ollama through multiple interfaces: the command-line interface for direct model interaction, the API for programmatic access, and the desktop application for a more user-friendly experience. I cover essential tasks like finding and downloading pre-configured models from Ollama's library, generating text responses, and configuring network access so you can access your models remotely.

A significant portion of my guide focuses on integrating models from Hugging Face. I demonstrate two different approaches for importing custom models and explain the process of converting models to GGUF format when necessary. I also take you inside Ollama's internal architecture, showing you exactly where models are stored on your system and how you can modify these storage locations to suit your needs. Finally, I explore customization techniques through prompting, helping you tailor model behavior for your specific use cases. Whether you're deploying LLMs locally or in the cloud, this guide gives you the practical knowledge you need to maintain control over your AI infrastructure.