Windows Driver & GPU Troubleshooting for Local LLM Workloads

Part 4 of a 9-part series: teaching CompTIA A+ (Core 1 / 220-1101 and Core 2 / 220-1102) through a real build, a private, local AI workstation/server for a small business.

The job: the model won't load

You've stood up the inference server, pulled the model weights, and issued the load command. The process starts, churns for thirty seconds, then dies with a message you don't expect: out of memory. The card is a 16 GB GPU. The model is a 7B-parameter quantized checkpoint that everyone on the internet says fits in 8 GB. The card is sitting there, mostly idle.

This is one of those failures that feels like a hardware fault and turns out to be a driver/software configuration problem: specifically a VRAM-budgeting mismatch between the inference server's default assumptions and the actual free memory on the card. The fix is three settings changes. But you don't know it's a settings problem until you've walked the GPU stack from top to bottom and ruled out the hardware and driver layers first. That walk is what this article is about.

The GPU stack on a Windows-hosted workstation running a Linux inference server has four distinct layers, and a problem at any layer produces symptoms that look like a problem at a different layer:

Physical layer, the card is seated, powered, and recognized by the motherboard.
Driver layer, the vendor driver (Windows Device Manager), plus the CUDA toolkit and runtime on the Linux side, must match.
Inference server layer: the server's VRAM budget setting controls how much of the card's memory the server claims on startup.
Model layer: the model's weight footprint must fit within the budgeted VRAM.

You verify them in order, bottom to top. Don't skip to layer 3 when the symptom manifests at layer 4: the fix you find will be fragile if layer 2 is broken underneath it.

📘 Objectives covered (220-1101 / 220-1102) >This article maps to the following CompTIA A+ exam objectives. >Core 1 (220-1101)- 3.4 (Expansion cards: GPU hardware) PCIe slot requirements, physical seating, and supplemental power connectors. What the card needs from the motherboard and PSU to function at all.- 3.5, Power supply: PSU output requirements for high-TDP GPUs; symptoms of insufficient power (intermittent failures, card not recognized).- 5.x: Hardware/display troubleshooting: systematic hardware troubleshooting methodology; display/GPU failure modes; reading nvidia-smi output as a diagnostic tool. >Core 2 (220-1102)- 1.x (Operating Systems: Windows Device Manager) reading error codes, updating drivers, rolling back a bad driver update. The difference between the Windows vendor driver and the CUDA toolkit that user-space GPU software depends on. >Concepts taught below: GPU/VRAM basics, PCIe and power requirements, thedriver stack (vendor driver → CUDA toolkit → CUDA runtime), Device Managerdiagnostics, how an inference server budgets VRAM, and the CUDA version-matchproblem.

Concepts: the GPU hardware and driver stack

What a GPU needs from the hardware (1101 3.4 / 3.5)

A discrete GPU connects to the system via PCIe: almost always a full-length x16 slot for a full-speed workstation card. The slot carries both data (the PCIe lanes) and 75 W of power; a high-TDP card also draws power from one or two supplemental PCIe power connectors (the 6-pin and 8-pin connectors from the PSU, or the newer 16-pin/600W connector on the top-tier cards). The card will not function if the supplemental connectors are absent or mismatched: some cards simply won't POST; others run at reduced speed or crash under load.

For an inference workload, three hardware facts matter most. PCIe x16 bandwidth: a card sharing lanes at x8 doesn't crash but bottlenecks weight transfers. PSU headroom: a high-end GPU pulls 300–600 W under compute load; an under-spec'd PSU produces random crashes during weight loading that look like software bugs. Physical seating: the retention latch must be fully engaged; a half-seated card may pass idle but lose contact under thermal expansion at load.

VRAM: why it's the gating resource

VRAM (Video RAM) is the dedicated memory on the GPU: separate from system RAM and not interchangeable with it. For AI inference, VRAM is the gating constraint. A language model's weights must fit entirely in VRAM to run at GPU speed; if they don't, the framework either refuses to load or spills to system RAM and CPU, dropping performance by an order of magnitude.

The VRAM budget for a given model has three components:

Model weights, the static size of the checkpoint, after quantization if applicable (a 7B model at 4-bit quantization is roughly 4–5 GB).
KV cache: the runtime memory the model uses to track the attention state of in-progress conversations. vLLM and similar servers allocate this aggressively up front to support continuous batching; it can be several GB on its own.
Framework overhead, CUDA context, kernel buffers, memory for graph capture (if the server uses torch.compile or CUDA graphs). This is 1–2 GB and not always visible in VRAM estimates.

Add those three together and "a 7B model at 8 GB" quickly becomes "a 7B model that needs 11–13 GB loaded", and a 16 GB card with a compositor or OS reservation holding 1–2 GB may not have enough free headroom at the default settings.

The driver stack: vendor driver, CUDA toolkit, CUDA runtime (1102 1.x)

On a Windows host running a Linux layer (WSL2), the GPU driver stack has more pieces than it appears:

Windows vendor driver, the NVIDIA (or AMD, or Intel) driver installed via Windows Update or downloaded from the vendor. This is what Device Manager shows you. It exposes the GPU to Windows and, through a paravirtualization layer, to the Linux kernel running inside WSL2.

CUDA toolkit, a set of compilers, libraries, and headers for building GPU software. Installing a specific version of the toolkit installs a matching set of runtime libraries.

CUDA runtime: the library that the inference server links against at runtime. This is the piece that must match the toolkit version baked into the GPU computing framework (PyTorch, JAX, etc.). If the runtime version doesn't match, you get a silent heap corruption or an immediate crash: not a clear "version mismatch" error.

The version-match requirement is the trap: a PyTorch build compiled for CUDA 12.8 may import cleanly against a 13.x runtime but corrupt the heap when weights load, free(): invalid pointer, not "version mismatch." The ABI difference only surfaces under compute load.

Hands-on walkthrough: diagnosing from hardware to application

Step 1, Verify the card is healthy at the physical/driver layer

Before touching any software setting, confirm the hardware is recognized and healthy.

In Windows Device Manager (1102 1.x):

Open Device Manager (devmgmt.msc) and expand Display adapters. The GPU should appear with no error icon. Error codes to know:

Code	Meaning	Common cause
Code 43	"Windows has stopped this device because it reported problems"	Driver corruption, ABI mismatch, or hardware fault
Code 10	"This device cannot start"	Driver not found or failed to initialize
Code 12	Insufficient resources assigned	PCIe resource conflict, try reseating or changing slots
Code 45	"Currently not connected"	Physical seating issue or missing power

Code 43 on a GPU that was working is almost always a driver issue, not a hardware fault. The first response is a clean driver reinstall: not a hardware swap.

To update or roll back a driver in Device Manager: right-click the device → Properties → Driver tab → Update Driver (or Roll Back Driver if a recent update caused the regression). On NVIDIA hardware, use DDU (Display Driver Uninstaller) in safe mode for a genuinely clean reinstall: Windows' built-in update path often leaves residue that causes Code 43 on reinstall.

From inside the Linux layer:

nvidia-smi

The header line shows Driver Version and CUDA Version (the maximum the driver supports: not necessarily what's installed). The Memory-Usage field shows how much VRAM is already consumed; on a card with a compositor or host reservation that can be 1–2 GB before the inference server starts. If nvidia-smi reports "no devices found" from inside Linux, the GPU passthrough from Windows is misconfigured, a WSL2 enablement step, not a Linux driver problem.

Step 2, Confirm the CUDA version stack matches

The inference server's GPU framework (PyTorch, in the case of vLLM) is compiled against a specific CUDA version. Mismatching that version against what's installed is a common source of cryptic startup crashes.

# Check the CUDA version the installed PyTorch was compiled for:
python3 -c "import torch; print(torch.version.cuda)"
# Example output: 12.1  ← this is the CUDA the torch binary expects

# Check the CUDA runtime available on the system:
nvcc --version   # shows the CUDA toolkit compiler — often absent on inference boxes
# Or check the driver's reported maximum:
nvidia-smi | head -1
# NVIDIA-SMI 570.86.15    Driver Version: ...    CUDA Version: 13.1

Two caveats: nvidia-smi's "CUDA Version" header shows the maximum the driver can support, not the installed runtime; nvcc --version shows the toolkit compiler and is often absent on inference boxes. The number that matters is torch.version.cuda: what your framework binary was built against.

If that doesn't match your installed runtime, the import may succeed but the crash arrives when the GPU kernel loads weights. The fix is to reinstall PyTorch from its own wheel index with the matching cu-tag. Plain pip install torch pulls from default PyPI, which ships a CPU-only build: not a CUDA wheel.

# Remove the mismatched build first:
pip uninstall -y torch torchvision torchaudio

# Reinstall from PyTorch's wheel index, using the cuXXX tag that matches
# your installed CUDA runtime (cu121 for CUDA 12.1, cu124 for 12.4, etc.):
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# Verify:
python3 -c "import torch; print(torch.version.cuda)"
# Should now match your runtime: 12.1

The pattern: install docs say to add --index-url .../cu128 for a CUDA 12.8 build, but the system runtime is different. The ABI mismatch shows as heap corruption on weight load, not a version warning. Don't hard-pin a cu-tag without verifying it against your actual runtime: use torch.version.cuda to confirm after reinstall.

Step 3: Read the inference server's VRAM budget setting

Which inference server you run depends less on the hardware than on who's using it. For a single person at one workstation (every local tool talking to one model) a lightweight server like Ollama is plenty. But once it's a shared box, a home lab or a small-business server where several people query the same model at the same time, you want a server built for concurrency. That's where vLLM-style servers come in: they use continuous batching to serve many simultaneous requests from a single loaded model, and they pre-allocate a fixed slice of VRAM on startup so throughput stays predictable under load instead of thrashing as requests pile up. That budget is set by a knob usually called gpu_memory_utilization: the fraction of the card's total VRAM the server claims for itself. The default is usually 0.90 or 0.92 (90–92% of the card).

On a 16 GB card with a compositor holding 1.3 GB:

Total VRAM: 16,384 MiB
OS/compositor reservation: ~1,343 MiB
Effective free at startup: ~15,041 MiB
Default utilization (0.90): 0.90 × 16,384 = 14,746 MiB target
Free at startup after compositor: 15,041 MiB — works, just barely
Default utilization (0.92): 0.92 × 16,384 = 15,073 MiB target
Free at startup after compositor: 15,041 MiB — FAILS (32 MiB short)

The server calculates the target against the card's total VRAM, not the free
VRAM at startup. If the OS reservation leaves you short, the allocation fails with
an out-of-memory error even though nvidia-smi shows "plenty of free memory" to
anyone eyeballing the number without doing the math.

In a vLLM-style server this setting lives in the runtime configuration: one
value that feeds both the launch argument and the serving manager's VRAM ledger.
Keep it a single source of truth: patching the launch argument without
updating the ledger causes the manager to over-commit the GPU on the next request.

# Runtime config:
runtimes:
  - name: vllm
    gpu_memory_utilization: 0.85   # lower from 0.90 to leave room

The runtime config feeds this value into the per-instance env file the systemd template sources:

# /etc/inference/instances/<instance>.env
# Written by the serving manager before starting the instance.
VLLM_MODEL=org/model-name-awq
VLLM_PORT=8000
VLLM_TENSOR_PARALLEL=1
VLLM_EXTRA_ARGS=--quantization awq --gpu-memory-utilization 0.85

# /etc/systemd/system/inference-server@.service (template)
[Service]
EnvironmentFile=/etc/inference/instances/%i.env
ExecStart=/opt/ai/venv/bin/python -m vllm.entrypoints.openai.api_server \
  --model ${VLLM_MODEL} \
  --port ${VLLM_PORT} \
  --tensor-parallel-size ${VLLM_TENSOR_PARALLEL} \
  $VLLM_EXTRA_ARGS
Restart=on-failure
RestartSec=15
TimeoutStartSec=600

TimeoutStartSec=600 matters: a cold model load with CUDA graph capture takes several minutes. The default 90-second systemd timeout kills the service first, producing a "start timed out" that looks like a crash.

Step 4, Tune and reload

With the CUDA stack confirmed and the budget setting in hand, the repair is:

Set gpu_memory_utilization to 0.85 in the runtime config (or lower if the OS reservation is larger on your hardware: measure it with nvidia-smi at idle before the inference server starts).
Restart the inference server:

   sudo systemctl restart inference-server@instance-name

Watch the startup logs:

   sudo journalctl -fu inference-server@instance-name

You're looking for the KV cache allocation line, vLLM logs how many blocks it allocated, which tells you the actual VRAM footprint at the chosen utilization.

Verification: prove the full stack

Once the server is running, verify every layer you walked through:

1. Hardware/driver layer (from Linux):

nvidia-smi
# Driver version present; no error in the first column.
# Memory-Usage shows the server's active reservation.

2. CUDA version match:

python3 -c "import torch; print(torch.version.cuda)"
# Must match the runtime shown in nvidia-smi header.

3. No Device Manager error codes (from Windows):

Open Device Manager → Display adapters → the GPU → Properties. The device status should read "This device is working properly." No error codes.

4. The inference server loads and answers:

# Health endpoint — should return 200:
curl -s http://localhost:8000/health

# Model list — confirms the model is loaded:
curl -s http://localhost:8000/v1/models | python3 -m json.tool

# A quick probe completion:
curl -s http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "your-model-id", "prompt": "Hello", "max_tokens": 10}' \
  | python3 -m json.tool

A 200 on /health plus a completion response verifies the full chain: Windows driver → WSL2 passthrough → CUDA runtime → inference server → model in VRAM → tokens out.

🎯 What the exam asks >The A+ exam frames GPU and driver material in predictable ways. Know these: >- Device Manager error codes are tested directly. Code 43 ("Windows has stopped this device") = driver issue or hardware fault, driver reinstall first. Code 10 = device can't start, driver not found or failed. Code 45 = not connected, check physical seating. The exam will give you a code and ask what it means or what to do first.- Driver update vs. rollback. The exam tests that you know how to update a driver (Device Manager → right-click → Update Driver), but also how to undo a bad update (Properties → Driver tab → Roll Back Driver). Roll back is the right first action when a previously working device broke after a driver update: not reinstall from scratch.- Expansion card seating / power. A GPU symptom scenario on the exam often has a physical root cause, card not fully seated, PCIe power connector missing. "Video card not being detected" after installation = check physical seating and supplemental power connectors before touching drivers.- Integrated vs. dedicated GPU. The exam tests the distinction: an integrated GPU shares system RAM (no dedicated VRAM); a discrete/dedicated GPU has its own VRAM and connects via PCIe. A scenario where "adding a discrete GPU improved performance" points to the integrated GPU being replaced by dedicated silicon with its own memory.- PSU sizing for GPUs. "Symptoms of insufficient power" for a GPU = crashes or resets under load (not at idle), a PSU sizing symptom, not a driver symptom. The troubleshooting order the exam expects: physical hardware first, then drivers, then software/configuration. Don't jump to reinstalling the OS for a Code 43.

Common pitfalls (most of these are from the real build)

CUDA version mismatch, no clear error. The cu128/cu130 trap described above: you install a GPU framework with a pinned CUDA index URL, it imports fine, and the first weight load produces free(): invalid pointer: which doesn't sound like a version mismatch. Always verify torch.version.cuda against your installed CUDA runtime (nvcc --version if the toolkit is present, otherwise check runtime library paths). When in doubt, unpin, reinstall from PyTorch's wheel index with the correct cu-tag, and re-verify.
Over-budgeting gpu_memory_utilization for the available free VRAM. The default (0.90–0.92) is calculated against the card's total VRAM, not the free VRAM after the OS and compositor take their share. On any system with a desktop compositor, Wayland session, or Windows host reservation, subtract that overhead first, then set the utilization to leave a buffer. Failing to do this is the immediate cause of the "OOM on a card with plenty of memory" symptom.
Stale driver after an OS update. Windows updates occasionally replace a manually installed GPU driver with an older inbox driver. The result is a Code 43 or a regression in CUDA support: nvidia-smi reports an older CUDA maximum or doesn't run at all. Check Device Manager after any significant Windows update; reinstall the vendor driver if the version number regressed.
Missing PCIe power connectors. A GPU installed by someone unfamiliar with the card's requirements may be connected only to the slot's 75 W bus power, with the supplemental 6-pin or 8-pin connectors left unattached. The card may POST and appear in Device Manager but crash under any real compute load. Always check that all supplemental connectors are seated when a card fails under load but looks fine at idle.
Setting gpu_memory_utilization in two places. Patching the server's launch argument directly and separately setting the VRAM ledger causes them to drift: the manager over-commits on the next request. One config key, one place.
Startup timeout too short / nvidia-smi only on the Windows side. Two separate traps that look alike: (a) the default 90-second systemd TimeoutStartSec kills the service before CUDA graphs finish compiling: set it to 600. (b) The GPU appears healthy in Device Manager but nvidia-smi inside Linux reports "no devices found", that's a WSL2 GPU passthrough gap, not a Windows driver problem. Always verify from inside the Linux layer.

Recap + what's next

The model wouldn't load because a chain of small mismatches, VRAM budget math that didn't account for the OS reservation, and a CUDA version mismatch that imported cleanly but crashed at weight load: compounded into an OOM that looked like a hardware problem. Walking the stack from Device Manager down through nvidia-smi → CUDA version check → inference server budget config gave each layer a chance to surface its own issue before the next one was trusted. Every layer checked out. The fix was two configuration changes and a framework reinstall; the GPU was fine the whole time.

That's the GPU stack done. But the workstation also runs a handful of Linux services (a database, a cache, an API backend, the inference server itself) and after a host reboot one of them keeps failing to start. It's not a driver problem; it's the kind of service-dependency and log-reading problem that shows up constantly on self-hosted Linux systems. Reading logs intelligently and knowing where services break when they start out of order is its own skill set.

Next up: Part 5: "Troubleshooting Docker & Linux Services on a Small Business Server." A container keeps restart-looping after a host reboot. We'll read journalctl output, check systemd unit ordering, trace a bind-mount wipe, and cover the difference between Compose-managed and systemd-managed services: covering 1102 3.x software troubleshooting and 1102 1.x Linux service management. See you there.