Module 2 · Lesson 2.3

Running a local LLM on your own machine.

By the end of this lesson, you can explain what a local LLM runner does in one sentence, decode the family:size-purpose-quantization model-name format, and — if your laptop has the hardware — install a runner, pull a model, hold a short conversation with the model from your terminal, and read the three numbers (latency, memory, tokens per second) that decide whether your setup is usable for real work. If your laptop doesn't have the hardware, this lesson is read-only: read for the concept (it matters for Module 9's privacy work), skip the install, and pick up at Lesson 2.4.

Stage 1 of 3

Read & Understand

5 concept blocks

What a local LLM runner actually is CORE

Running a model on your own machine sounds complicated. It is not — mostly because someone else has already done the hard parts. The hard parts are: loading billions of internal numbers from disk into memory, wiring them up to run on whatever hardware you have (CPU, Apple Silicon, NVIDIA GPU, AMD GPU), and exposing a simple way for you to send it text and get text back.

A local LLM runner is the class of tool that handles all of that. You install it once; it gives you two things. First, a command that downloads model weights — ten gigabytes or so for a typical 8 B (8-billion) model — and stores them on your disk. Second, an interface — a terminal prompt, a chat window, or an API on localhost — that lets you send prompts to the model and receive completions. The details differ by product, but every runner in this category does the same job: bridge “model weights” and “your prompt.”

Understanding this helps you sit through any tool change. Specific products come and go. The job the tool is doing is durable.

Why you should run one even if you prefer cloud CORE

Even if Lesson 2.1 left you convinced that cloud is your default, you should still run a local model. Three reasons.

Grounding. Running a model locally makes it concrete. You see the file. You watch memory rise when you start inference. You feel the difference between a 7 B and a 13 B model on your machine. Students who skip this step end up with a vague sense of “the model is somewhere” that will bite them later when they are designing real systems.

Directing practice. Local models are smaller, and smaller models are more exacting to direct. A weaker model fails in more visible ways than a frontier cloud model. Directing a local 8 B model teaches you — fast — the difference between a prompt that works and a prompt that does not. That feedback transfers directly to making you better at cloud directing.

The fallback. There will be a day — a power outage, a trip, a provider incident — when cloud is not there. If you can still reach a local model, you can still work.

You do not need to like local models more than cloud models. You do need one installed and working.

Hardware vocabulary you need before the install CORE

Five terms appear all over this lesson and the recipes. Read them once now and the rest of the lesson will be much easier to follow.

  • CPU — the main processor in your laptop. Every laptop has one. Runs everything. Not specialized for AI math, so model inference on a CPU alone is slower than on a GPU.
  • GPU (graphics processing unit) — a chip designed for the kind of parallel math that AI models love. Originally built for video games, now the default hardware for AI work. A laptop either has a GPU or it does not.
  • Discrete GPU — a separate, dedicated GPU chip. Common on gaming laptops (NVIDIA RTX, AMD Radeon RX). Uncommon on thin-and-light or budget laptops. If your laptop runs hot under load and has a fan, it likely has one; if it is silent and thin and cheap, it likely does not.
  • Apple Silicon — Apple's M1, M2, M3, and M4 chips, found in Macs from late 2020 onward. Designed so the CPU and GPU share the same pool of RAM, which makes Apple Silicon Macs surprisingly capable for local AI even without a “discrete” GPU in the traditional sense. If you have a Mac from the last four years, you almost certainly have Apple Silicon.
  • CPU-only — running the model on your CPU because no compatible GPU is available. Still works for the smaller models in this lesson; just slower (plan for 5–15 tokens per second on a quantized 7–8 B model rather than 30+ on a GPU).

And one more concept that determines whether a model fits on your machine at all:

  • Quantization — compressing a model's numerical weights to lower precision (for example, 4-bit instead of 16-bit) so the model fits in less RAM and runs faster. Costs roughly 3–5% of the model's quality and saves roughly 3× the memory. The recipes below default to a q4_K_M quantization, which is the standard “small but good” version.

If you are on a Chromebook, an iPad, or a laptop with less than 8 GB of RAM, the local-model recipes in this lesson will not run reliably. Skip the install, do the activity in paper mode, and treat Lesson 2.4 (cloud accounts) as your real workstation. The rest of the course works on cloud alone.

Decoding the model-name format CORE

You're about to type a command like ollama pull llama3.1:8b-instruct-q4_K_M. That string isn't random. Here's what each piece means:

llama3.1 : 8b - instruct - q4_K_M
   |       |       |          |
   |       |       |          └─ Quantization: 4-bit, "K_M" variant
   |       |       └──────────── Purpose: trained to follow instructions
   |       └──────────────────── Size: 8 billion parameters
   └──────────────────────────── Family: Meta's Llama 3.1
  • Family (llama3.1, mistral, qwen2, etc.) — who made the model and which generation.
  • Size (8b, 7b, 3b, 13b) — billions of parameters. More is more capable; more also requires more RAM.
  • Purpose (instruct, chat, or absent) — how the model was fine-tuned. instruct and chat models are trained to answer questions and follow directions. A model with no purpose suffix is a “base” model that just continues text — usually not what you want.
  • Quantization (q4_K_M, q5_K_M, q8_0, etc.) — the compression flavor. Higher numbers = bigger file, slightly better quality. q4_K_M is the standard balance.

You don't have to memorize every variant. You do need to know that the model name is exact — typing llama3.1:8b without the rest will fail, and the registry won't auto-correct.

Latency, memory, and tokens per second CORE

Three numbers matter when you run a local model. You do not need to measure them with a stopwatch; your runner prints them.

Latency is how long between when you hit Enter and when the first token appears. For a local 7–8 B model on a modern laptop, this is usually under two seconds on Apple Silicon or a discrete GPU, and three to eight seconds on CPU-only.

Memory use is how much of your RAM the model is consuming while loaded. For a quantized 8 B model, expect roughly 5–7 GB. If your laptop has 8 GB total, you are running tight; close other apps. If it has 16 GB or more, you have room for real multitasking while the model runs.

Tokens per second (tok/s) is how fast the model generates output once it has started. To make the numbers concrete: a fast human types about 5 tokens per second; you read at maybe 4–6 tokens per second; a comfortable AI chat experience feels like 20–40 tok/s. So:

  • 30+ tok/s (Apple Silicon or discrete GPU): faster than you can read. Comfortable.
  • 8–15 tok/s (CPU-only on a recent laptop): slower than reading speed. Sluggish but usable for short tasks.
  • Under 5 tok/s: one word every two seconds. Too slow for real iterative work — try a smaller model or move that workload to cloud.

A local setup is usable for real work if your typical task finishes in under a minute. If a one-paragraph summary takes four minutes, the tool is technically installed but practically unusable. Knowing this in advance prevents wasted effort.


Stage 2 of 3

Try & Build

2 recipes + activity

Install Ollama and pull your first model RECIPE

ToolOllama
Version tested0.1.48
Last verified2026-04-17
Next review2026-07-17
Supported OSesmacOS, Windows (Linux callout below)

Before you begin

  • You have completed the Lesson 2.1 capability check and know your RAM, chip, and free disk space. If those are still in your head only, write them into my-first-loop.md under a “Hardware” heading now — you'll need them in step 5 below.
  • You have at least 20 GB free disk space.
  • You are on a stable internet connection for the download (model file is ~5 GB).

macOS steps

  1. Open your browser and go to https://ollama.com/download. Click the macOS download. A .dmg file will download.
  2. Open the .dmg. Drag Ollama.app into your Applications folder. Launch it. A small llama icon appears in your menu bar — that is the background service.
  3. Open the Terminal app (Cmd+Space for Spotlight, type “Terminal”, press Enter). Or open the integrated terminal inside VS Code from Lesson 2.2.
  4. Verify the install: ollama --version. You should see a version number. If you get “command not found,” restart your Terminal window and try again.
  5. Pull the model your hardware supports:
    • 16 GB RAM or more: ollama pull llama3.1:8b-instruct-q4_K_M (this command says: download Meta's Llama 3.1, the 8-billion-parameter instruction-tuned variant, quantized to 4 bits — about 5 GB)
    • 8 GB RAM: ollama pull mistral:7b-instruct-q4_K_M (smaller footprint, still capable)
    • Less than 8 GB RAM: ollama pull llama3.2:3b-instruct-q4_K_M (smallest recommended; not all tasks will work well)
    • Download takes 5–10 minutes on a typical home connection. Progress bar appears.
  6. Talk to the model. Type ollama run llama3.1:8b-instruct-q4_K_M into the terminal and press Enter (substitute the model name you pulled if you chose a different one). You will see a >>> prompt — that means the model is ready and waiting. Type a question and press Enter again. The model will respond.
  7. To exit the model session, type /bye and press Enter.

Windows steps

  1. Open your browser and go to https://ollama.com/download. Click the Windows download. An .exe installer will download.
  2. Run the installer. Accept defaults. Ollama will install and start in the background (you'll see a llama icon in the system tray near the clock).
  3. Open Windows Terminal (Start menu → “Terminal”) or VS Code's integrated terminal from Lesson 2.2.
  4. Verify the install: ollama --version. You should see a version number.
  5. Pull the model — same commands as Mac. Pick the model that fits your RAM.
  6. Talk to the model: ollama run <model-name>. Type a question. Get a response.
  7. Exit with /bye.

Verify it is working

Start a fresh model session — type ollama run llama3.1:8b-instruct-q4_K_M (or whichever model you pulled) in the terminal and press Enter. When you see the >>> prompt, type the following question, press Enter, and watch what comes back:

“In one paragraph, explain what a local LLM runner does. Do not use the name of any specific product.”

You should see: a coherent paragraph, arriving in under three minutes end-to-end, from a model you pulled to your own disk. If it arrives, the install is done. If you see an error or the response never appears (wait at least 90 seconds before assuming it's broken), see the troubleshooting callout below.

Safe default — Cost and privacy posture for local

  • Cost: zero marginal cost per run. Your electricity bill is the only variable.
  • Privacy: your prompt does not leave the machine. This is the privacy mode.
  • Default this lesson commits to: models are pulled from the Ollama registry over the network during install, but once pulled, all inference is local. You can verify by turning off Wi-Fi and running a prompt — if the response still comes back, the model is fully on your machine.

Troubleshooting

  • “ollama: command not found.” Terminal opened before the install finished registering. Close all Terminal windows and open a new one. On Mac, also try Cmd+Q-ing the Ollama app and relaunching it.
  • Model pull stalls or shows “0 B/s.” Your connection dropped or the registry is slow. Cancel with Ctrl+C, then re-run the same ollama pull command — it resumes from where it stopped.
  • Response is very slow (under 5 tok/s). Your machine is CPU-only and the model is too large. Try the next size down: ollama pull llama3.2:3b-instruct-q4_K_M or switch to Mistral 7B.
  • Out-of-memory crash. Close other apps, especially browsers. If it keeps happening, you are above your machine's RAM ceiling; use a smaller model.
  • /bye doesn't exit. Press Ctrl+C to force-quit the session. You can always close the terminal window.

Linux equivalents. Linux install: curl -fsSL https://ollama.com/install.sh | sh. (Read the script first if you're cautious — it's standard practice. The URL goes to Ollama's official site.) Same ollama pull and ollama run commands apply. Full Linux recipes are on the backlog for Q3 2026.

Alternative path: LM Studio (graphical front end) RECIPE

ToolLM Studio
Version tested0.3.x
Last verified2026-04-17
Next review2026-07-17
Supported OSesmacOS, Windows

If a graphical interface (clicking instead of typing) suits you better than the command line, LM Studio does the same job as Ollama behind a chat-style window. Download from https://lmstudio.ai. The app handles model search, download, and chat in one place. Pull the same model classes (Llama 3.1 8B Instruct or Mistral 7B Instruct — the model-name format above still applies).

Tradeoffs vs. Ollama: LM Studio is friendlier the first hour; Ollama is easier to script against from Cowork and Claude Code, which you will want by Module 6. The course recipes assume Ollama from Lesson 2.5 onward, but if LM Studio is what gets you over the first-install hump, start there and migrate to Ollama before Module 6.

Safe default — Cost and privacy posture for LM Studio. Same as Ollama: no per-run cost, no prompt leaves the machine after install. Verify by turning off Wi-Fi and sending a prompt.

Try it CORE

Local-model latency & cost calculator

What you do. Run three prompts through your local model — one short, one medium, one long — and for each one, record two numbers: the latency (how many seconds between hitting Enter and seeing the first word) and the tokens per second (how fast the response generates after that). Enter them in the calculator. The calculator shows you:

  • Whether your setup is comfortable, workable, or below threshold for each prompt length.
  • A rough “break-even” estimate vs. cloud pricing: at what daily volume your local setup saves money given your measured throughput.

No local model installed? If your machine cannot run a local model (under 8 GB of RAM, Chromebook, etc.), use the paper version of the worksheet and a cloud model instead — record the same three numbers from your cloud provider's response timing once you have one set up in Lesson 2.4. The lesson on directing applies either way; the threshold numbers will be different but the framework holds.

Deliverable. Your filled-in table of the three prompts, plus one-sentence verdicts: “Local is comfortable for tasks of this size on my machine: ___. Too slow for: ___.”

Done with the hands-on?

When the recipe steps and the activity above are complete, mark this stage to unlock the assessment, reflection, and project checkpoint.

Stage 3 of 3

Check & Reflect

key concepts, quiz, reflection, checkpoint, instructor note

Quick check CORE

Four questions. Pick the best answer, then reveal the explanation.

Q1. Which of the following best describes a local LLM runner's job?
  • A It trains a model on your data.
  • B It downloads model weights to your disk and lets you prompt the model on your own machine.
  • C It converts cloud API calls into local ones.
  • D It compresses cloud models so they are cheaper to use.
Show explanation

Answer: B. The runner's job is “bridge weights to prompts,” locally. A confuses inference with training; runners run models, they do not train them. C describes a proxy, not a runner. D confuses runners with quantization tools.

Q2. You pull an 8 B model and your generation speed is 3 tokens per second. What should you do?
  • A Nothing — that is normal for 8 B models.
  • B Pull a smaller model or move that workload to cloud — 3 tok/s is below the usability threshold for real work.
  • C Upgrade your RAM.
  • D Delete the model and restart.
Show explanation

Answer: B. 3 tok/s is roughly one word every two seconds, slower than reading speed. A smaller model (Mistral 7B or Llama 3.2 3B) will be faster; for the tasks where capability really matters, cloud is the right call. A is factually wrong — 8 B at 3 tok/s usually means CPU-only inference on a weaker machine. C may help eventually but is not the immediate answer. D is overreaction.

Q3. A student runs a prompt containing a private journal entry through Ollama. After the response comes back, they turn off their Wi-Fi and run another prompt. What should they expect?
  • A The second prompt fails because Ollama needs internet.
  • B The second prompt works; Ollama has been local all along.
  • C The second prompt works but the first was sent to the cloud.
  • D Ollama uploads prompts occasionally for quality monitoring.
Show explanation

Answer: B. Once a model is pulled, inference is fully local. The journal entry never left the machine. The Wi-Fi test is a deliberate way to verify this — if prompts still work offline, you have proof. A is wrong; Ollama only needs the network during ollama pull. C is a confused model of how runners work. D would make the privacy story false, and responsible runners do not do this.

Q4. Decode this model name: llama3.1:8b-instruct-q4_K_M. Which is correct?
  • A Llama family, version 3.1, 8 billion parameters, instruction-tuned, 4-bit quantization.
  • B Llama family, 3.18 billion parameters, instruction-tuned, q4 mode.
  • C The numbers don't mean anything specific; it's just an identifier.
  • D Llama 3.1, an 8-bit model, with quantization mode K_M.
Show explanation

Answer: A. The format is family : size - purpose - quantization. llama3.1 is the family (Meta's Llama 3.1 generation); 8b is 8 billion parameters; instruct means the instruction-tuned variant; q4_K_M is the 4-bit quantization, “K_M” flavor. B and D misread the parts. C is wrong — every part is meaningful.

Reflection prompt

The first token landed.

In 4–6 sentences: Describe what you noticed when the model's first token appeared — was it faster or slower than you expected, and what did it change (if anything) about how you think about “where the model is”? Then: Based on your measured tokens/sec, name one real task you now trust local for and one you do not.

Project checkpoint

Record your local-model setup in your loop file.

Open my-first-loop.md. Below the Model line you wrote in Lesson 2.1, add three lines:

Hardware: [your RAM, your chip, your free disk space]
Local-model runner installed: [Ollama / LM Studio / none yet]
Local-model throughput on my machine: [tokens/sec from the activity]. Usable for tasks like: [one phrase].

Save the file. If you installed both Ollama and LM Studio, note both. If your machine couldn't run either, write “skipped — running cloud-only” — that is a defensible posture and Lesson 2.4 still works.

Next in Module 2

Lesson 2.4 — Cloud AI accounts, API keys, and coding agents.

Stand up an Anthropic console account, set a real cost cap, generate an API key, store it in .env, run a verified Python test, and install Cowork and Claude Code as your coding agents. The full directing surface, set up in one tightly-coupled lesson.

Continue to Lesson 2.4 →