RDF Industries
All insights
SOHO AI Build · Part 3 of 9By · Josh Rogers

From Cables to Certificates: Networking a Multi-Node AI Cluster

Wiring a multi-node AI cluster from cables to certificates: ports and protocols, IP and DNS, switching, and TLS/HTTPS configured on purpose instead of by accident.

Part 3 of a 9-part series: teaching CompTIA A+ (Core 1 / 220-1101 and Core 2 / 220-1102) through a real build, a private, local AI workstation/server for a small business.


The job: node-2 can't reach node-1's inference port

The second machine is physically in the rack, cabled, powered on. You run the join command and it times out ("connection refused" on port 8002) and the inference server is definitely running on node-1.

"Connection refused on a port" can mean half a dozen things at half a dozen layers: cable, VLAN, firewall rule, DNS failure in the join handshake, or the port bound to 127.0.0.1 instead of 0.0.0.0. Working through it in order (physical first, network layer next, application last) is what separates a ten-minute fix from a two-hour chase.

This article layers the network stack from the bottom up: cabling → ports and protocols → IP addressing and subnets → DNS → switching vs. routing / OSI layers → TLS handshake → and finally, the reason this build uses two completely separate certificate chains. By the end you'll understand every choice, and you'll have the verification commands to confirm each layer.


📘 Objectives covered (220-1101) >Core 1 (220-1101)- 2.1, TCP and UDP protocols: TCP vs UDP trade-offs, well-known port numbers, common application-layer protocols and which transport they run on.- 2.2, Networking hardware: switches vs routers, managed vs unmanaged, access vs trunk ports.- 2.3 (Wireless networking protocols: covered contextually) the cluster runs wired; WireGuard tunneling addresses the encrypted-fabric goal.- 2.4, Networked host services: DNS resolution, DHCP vs static addressing, well-known service ports.- 2.5, SOHO network installation and configuration: IPv4, subnets, CIDR, static vs DHCP, basic switch/router setup.- 2.6, Network configuration concepts: TCP/IP stack, IPv4 classes and private ranges, APIPA, DNS resolution chain.- 2.7, Internet connection types and network types: LAN, WAN, VPN (WireGuard as the inter-node fabric).- 2.8, Networking tools: ping, nslookup/dig, traceroute, ss/netstat, ip addr. >Concepts taught: OSI model in practice, TCP vs UDP with port numbers, IPv4 addressing and subnetting (CIDR), DNS resolution, switches vs routers, TLS handshake (public/private key roles, certificate chains), public CA vs private CA / mTLS, WireGuard VPN tunneling.

Concepts: the network stack, layer by layer

Physical layer: cabling and the switch (1101 2.2, 2.5)

Before you debug a port number or a certificate, you need electrons arriving.

Cabling. Use Cat6 or Cat6A for a multi-node cluster. Cat6 handles 10GBASE-T up to 55 meters; Cat6A gets you the full 100-meter 10 GbE run. Cat5e tops out at 1 Gbps. A marginal crimp passes most traffic and drops random packets under load: it looks like a software bug until you swap the cable.

Switch vs. router. A switch forwards frames on a local segment by MAC address (Layer 2). A router forwards packets between network segments by IP address (Layer 3). Both nodes on the same /24 subnet talk through the switch; the router only matters when a node needs to reach the internet.

An unmanaged switch is plug-and-play and learns MAC addresses automatically: fine for a two-node cluster. A managed switch adds VLANs, port mirroring, and SNMP: useful if you eventually need to segment cluster traffic from office traffic.

ip link show eth0
# state UP, 10000Mb/s confirms the link is up at 10 GbE

A port negotiating down to 100 Mbps usually means a bad cable, mismatched SFP, or NIC auto-negotiating down. Check the cable first.


TCP, UDP, and well-known ports (1101 2.1)

TCP (Transmission Control Protocol) opens a connection before sending data via a three-way handshake (SYN → SYN-ACK → ACK), then guarantees delivery and ordering. The cost is latency: at minimum one round-trip before data flows.

UDP (User Datagram Protocol) sends datagrams with no handshake, no acknowledgment, no retransmission, faster, but the application handles loss. DNS queries use UDP (fast, small payloads, easy to retry). DHCP uses UDP. WireGuard tunnels over UDP.

Port

Protocol

TCP/UDP

What it's for

22

SSH

TCP

Secure remote shell

53

DNS

UDP (+ TCP for large responses)

Name resolution

67 / 68

DHCP

UDP

Server / client address assignment

80

HTTP

TCP

Unencrypted web

443

HTTPS

TCP

Encrypted web (TLS)

3389

RDP

TCP

Windows Remote Desktop

5432

PostgreSQL

TCP

Database

6379

Redis

TCP

Cache / pub-sub

The inference server runs on port 8002 in this build (chosen to avoid the backend API on 8000 and cluster manager on 8001). That port must be open in the firewall and listened on the right interface.

Diagnosing "connection refused":

  1. Is the service listening? (ss -tlnp on the server)
  2. Is it bound to the right interface? 127.0.0.1 = local only; 0.0.0.0 = all interfaces.
  3. Is a firewall blocking it?

"Connection refused" means the TCP SYN arrived but the OS replied RST: nothing is listening or the packet was actively rejected. "Connection timed out" means the SYN never got a response, firewall, wrong IP, or a physical/routing problem.


IPv4 addressing and subnets (1101 2.5, 2.6)

Every device needs an IP address, a subnet mask, and a default gateway.

CIDR

Subnet mask

Hosts

Typical use

/24

255.255.255.0

254

Home / small office

/16

255.255.0.0

65,534

Medium organization

/30

255.255.255.252

2

Point-to-point links

/32

255.255.255.255

1

Single host route

The three RFC-1918 private ranges (not routable on the public internet):

  • 10.0.0.0/8, up to 16 million hosts
  • 172.16.0.0/12, medium private range
  • 192.168.0.0/16, common home/office range

Two-node assignment:

node-1:  192.168.10.10/24   gateway: 192.168.10.1
node-2:  192.168.10.11/24   gateway: 192.168.10.1

Both on 192.168.10.0/24: node-2 sends packets to node-1 directly via the switch. Same subnet = switch (local delivery); different subnet = router (gateway). That's the whole switching-vs-routing decision.

Static vs. DHCP. Cluster nodes get static IPs. DHCP is great for laptops that move around; if node-1's address drifts, the cluster breaks silently. Set static assignments on the nodes or as DHCP reservations (MAC → fixed IP).

APIPA. If a host can't get a DHCP lease, Windows self-assigns a 169.254.x.x/16 address. Seeing 169.254.x.x is the diagnostic signature of "couldn't reach the DHCP server." Link-local only.


DNS: translating names to addresses (1101 2.4, 2.6)

IP addresses are what computers use; node-1.example.internal is what configurations and humans use. DNS bridges the two.

When node-2 connects to node-1.example.internal, the OS queries a resolver, which walks the hierarchy: root nameservers → authoritative server for the zone → answer. The answer is an A record (IPv4 address), an AAAA record (IPv6 address), or a CNAME (alias pointing to another name). Know A vs. AAAA for the exam.

Two options for a private cluster:

  1. /etc/hosts entries, checked before DNS, works offline, no server required. Every node needs its own copy in sync.
   192.168.10.10   node-1  node-1.example.internal
   192.168.10.11   node-2  node-2.example.internal
  1. Local DNS server (dnsmasq, Pi-hole, router built-in), easier at scale, another service to maintain.
nslookup node-1.example.internal   # does the name resolve?
dig node-1.example.internal A      # what server answered, TTL, record type?
cat /etc/nsswitch.conf | grep hosts
# expect: files dns   (/etc/hosts checked first)

If nslookup fails but ping 192.168.10.10 succeeds, the problem is specifically DNS. Fix the hosts file before debugging anything else.


Switching vs. routing in the OSI model (1101 2.2)

Layer

Name

What it does

Examples

1

Physical

Electrical signals, cables

Cat6, fiber, RJ-45

2

Data Link

Framing, MAC addressing, local delivery

Ethernet, Wi-Fi (802.11)

3

Network

IP addressing, routing

IPv4, IPv6, ICMP

4

Transport

End-to-end delivery, ports

TCP, UDP

5–7

Session / Presentation / Application

What the app does; TLS sits here for the handshake

HTTP, DNS, SSH, HTTPS

A switch operates at Layer 2, reads Ethernet frames, forwards by MAC, learns which MAC lives on which port. A router operates at Layer 3, reads IP packets, forwards by IP between different networks. In this cluster, inter-node traffic never leaves the switch (same /24); the router handles only internet-bound packets.

Troubleshoot from Layer 1 up: cable → switch → IP → DNS → port → app. Diagnosing a certificate problem when the cable is marginal wastes time.


TLS: the handshake and certificate chains (1101 2.1)

TLS (Transport Layer Security) provides authentication (the server proves it's who it claims) and encryption (the conversation is unreadable to eavesdroppers). It uses asymmetric cryptography:

  • Private key: kept secret on the server; signs and decrypts.
  • Public key, shared freely in the certificate; verifies signatures, encrypts data only the private key can decrypt.

A certificate contains the server's public key, the hostname it's valid for (Subject Alternative Name), and a CA's digital signature vouching for it. Browsers carry a built-in trust store of root CAs. They walk the certificate chain (leaf → optional intermediates → root CA) and show the padlock if everything checks out.

The TLS handshake (simplified):

Client                               Server
  |── ClientHello (TLS version, ciphers) ──►|
  |◄── ServerHello + Certificate ───────────|
  |  (client verifies cert chain + SAN)     |
  |── KeyExchange (ECDHE share) ───────────►|
  |◄── Finished ────────────────────────────|
  |── Finished ────────────────────────────►|
  |   [encrypted application data flows]    |

Both sides derive the same symmetric session key from the handshake without transmitting it. All bulk data uses fast symmetric crypto (AES-256, ChaCha20); asymmetric crypto handles only key establishment.

TLS 1.3 is the current standard: drops insecure cipher suites, requires forward secrecy (ephemeral session keys so a later private-key compromise can't decrypt old sessions), and cuts the handshake to one round-trip. This build enforces TLS 1.2 minimum, TLS 1.3 preferred. SSL is deprecated and broken: TLS is its successor.


Two certificate chains: browser trust vs. node trust

This is the most important design decision in the cluster's TLS setup, and it's a pattern you'll see in any serious multi-service private deployment.

Two distinct trust problems:

  1. Browser → web interface. Browsers require a cert that chains to a root CA already in their trust store, with the hostname matching a SAN.
  2. node-2 → node-1 (peer authentication). Nodes don't need globally-trusted certs, they need to agree on a private CA that both trust, then verify each other's certs against it.

Solving both with one cert would couple browser-cert renewal to node auth (fragile), and would use a globally-trusted cert for machine identity (wrong threat model).

Chain 1: The browser-facing cert.

The reverse proxy (Caddy) terminates TLS on port 443. Three deployment modes, each rendered to a snippet at install time:

# Internal CA — Caddy generates its own root, import once per workstation
tls internal

# Custom cert — operator supplies cert from their org's PKI
# tls /etc/caddy/tls/server.crt /etc/caddy/tls/server.key

# ACME DNS-01 — Let's Encrypt, requires outbound internet + DNS provider API
# tls ops@example.com { dns cloudflare {$ACME_DNS_TOKEN} ; resolvers 1.1.1.1 }

Chain 2: Inter-node mTLS (private CA).

mTLS (mutual TLS) means both sides present certificates and both sides verify the other. Browser TLS is one-way: only the server presents a cert; the client is anonymous at the TLS layer. mTLS proves machine identity in both directions.

The join flow:

  1. Admin mints a one-time JWT for the joining node.
  2. node-2 generates a keypair, builds a CSR (Certificate Signing Request, public key + claimed identity), and sends CSR + JWT to node-1's join endpoint.
  3. node-1 validates the JWT (signature, expiry, not-yet-consumed), signs the CSR with the private CA's key, returns the signed cert and CA root.
  4. node-2 stores both. From here on each side verifies the other's cert chains to the shared private CA root.

The JWT is marked CONSUMED in the database after first use: replay attempts are rejected.

Why two chains? Different attack surfaces, different trust scopes. Renewing the browser cert doesn't touch node trust. Revoking a node doesn't affect the browser cert. A compromised browser cert can't impersonate a node, different CA, different verification path.


WireGuard: the encrypted inter-node fabric (1101 2.7)

mTLS handles node authentication, but inter-node traffic is still visible to anyone on the switch segment (encrypted by TLS at the session layer, but identifiable). For stricter data-handling requirements, the cluster adds WireGuard: a VPN tunnel that wraps all inter-node traffic in an additional encryption layer before it hits the wire.

A VPN (Virtual Private Network) creates an encrypted tunnel between endpoints. WireGuard uses UDP (fast, low-overhead) and modern elliptic-curve cryptography. Its codebase is ~4,000 lines: intentionally small for auditability (vs. OpenVPN's ~70,000).

Each node has a WireGuard keypair (separate from TLS certs). WireGuard creates a virtual network interface (wg0) with its own IP space. Cluster services communicate over WireGuard addresses, not bare Ethernet addresses. The effective path: physical NIC → WireGuard (wg0) → encrypted UDP → WireGuard on the other end → decrypted → application. An eavesdropper on the switch sees only opaque UDP.


Walkthrough: bring node-2 online

Step 1, Physical and IP connectivity

# node-2: confirm IP assignment and layer-3 reachability
ip addr show eth0          # expect: inet 192.168.10.11/24
ping -c 4 192.168.10.10   # 0% packet loss = Layer 1–3 is fine
nslookup node-1.example.internal  # confirms DNS resolves correctly

If ping fails, the problem is Layer 1–3. If ping succeeds but nslookup fails, the problem is DNS only: fix /etc/hosts or the DNS record before touching anything else.

Step 2: Configure static IPs (Netplan on Ubuntu Server)

# /etc/netplan/01-cluster.yaml on node-2
network:
  version: 2
  ethernets:
    eth0:
      dhcp4: false
      addresses: [192.168.10.11/24]
      routes:
        - {to: default, via: 192.168.10.1}
      nameservers:
        addresses: [192.168.10.1, 8.8.8.8]
        search: [example.internal]
sudo netplan apply && ip addr show eth0

The search: [example.internal] domain means nslookup node-1 expands to node-1.example.internal automatically.

Step 3, /etc/hosts entries (both nodes)

sudo tee -a /etc/hosts <<'EOF'
192.168.10.10   node-1  node-1.example.internal
192.168.10.11   node-2  node-2.example.internal
EOF

Step 4, WireGuard setup

sudo apt-get install -y wireguard

# Generate keypair on each node:
wg genkey | sudo tee /etc/wireguard/private.key
sudo chmod 600 /etc/wireguard/private.key
sudo cat /etc/wireguard/private.key | wg pubkey | sudo tee /etc/wireguard/public.key

Exchange public keys between nodes, then configure the tunnel:

# /etc/wireguard/wg0.conf on node-1
[Interface]
PrivateKey = <node-1-private-key>
Address = 10.100.0.1/24      # WireGuard virtual network
ListenPort = 51820

[Peer]
PublicKey = <node-2-public-key>
AllowedIPs = 10.100.0.2/32   # only node-2's WireGuard address through this tunnel
Endpoint = 192.168.10.11:51820

The /24 on Address declares the virtual subnet. The /32 on AllowedIPs routes only that peer's specific address through the tunnel: not the whole subnet.

sudo wg-quick up wg0
sudo systemctl enable wg-quick@wg0

Step 5, Firewall rules

sudo ufw allow 22/tcp          # SSH — always first
sudo ufw allow 443/tcp         # HTTPS browser access
sudo ufw allow in on wg0 to any port 8002 proto tcp   # inference — WireGuard only
sudo ufw allow from 192.168.10.11 to any port 51820 proto udp  # WireGuard on node-1
sudo ufw enable && sudo ufw status verbose

The inference port (8002) is only open on the WireGuard interface: not on bare Ethernet. Traffic from node-2 must travel through the encrypted tunnel to reach it.


Verification: confirm each layer

# Layer 4: what's listening and on which interface?
ss -tlnp
# LISTEN 0.0.0.0:443     → Caddy (HTTPS, all interfaces)
# LISTEN 10.100.0.1:8002 → inference (WireGuard virtual IP only)
# LISTEN 127.0.0.1:5432  → Postgres (localhost only)

# TLS cert chain
openssl s_client -connect node-1.example.internal:443 -showcerts </dev/null 2>&1 \
  | grep -E "subject=|issuer=|Verify return code"
# subject=CN=node-1.example.internal
# issuer=CN=Internal CA Root
# Verify return code: 0 (ok)   ← after importing the internal root CA

# Import the CA root if not yet trusted:
sudo cp /etc/ai-server/tls/ca-root.crt /usr/local/share/ca-certificates/ai-server-internal-ca.crt
sudo update-ca-certificates

# WireGuard tunnel state
wg show
# peer: <node-2-pubkey>
#   latest handshake: 30 seconds ago   ← live tunnel
#   transfer: 1.23 MiB received, 4.56 MiB sent
ping -c 4 10.100.0.2   # confirms tunnel delivers packets

# End-to-end: node-2 hits node-1's inference port through the tunnel
curl -s http://10.100.0.1:8002/health
# {"status": "ok", "model": "...", "vram_used_gb": ...}

The ss -tlnp address column tells you who can reach each service at a glance. The latest handshake timestamp in wg show is the key WireGuard indicator: no entry means the tunnel never established.


🎯 What the exam asks >- TCP vs UDP. TCP = connection-oriented (three-way handshake, reliable delivery, ordered). UDP = connectionless (no handshake, no guaranteed delivery, faster). DNS uses UDP. HTTP/HTTPS/SSH use TCP. "Reliable delivery" or "guaranteed order" → TCP. "Low-latency, loss acceptable" or streaming/VoIP/DNS → UDP.- Port numbers are tested directly. Memorize: 22 (SSH), 53 (DNS), 67/68 (DHCP), 80 (HTTP), 443 (HTTPS), 3389 (RDP). The exam also tests which service uses which port: "manage a remote Linux server securely" = SSH = port 22 = TCP.- TLS vs SSL. SSL is deprecated and broken; TLS is its successor. The exam may use "SSL certificate" loosely, but if TLS and SSL are both options, pick TLS for anything describing current/secure behavior.- Public key vs private key. The public key is shared freely, encrypts data for the owner, verifies the owner's signatures. The private key is kept secret: decrypts and signs. "A message encrypted with the public key can only be decrypted with the private key" is an exact exam concept.- Certificate chains. Leaf cert → (optional intermediate) → root CA in trust store. "Verify return code: 0 (ok)" from openssl s_client = chain verified. "Self-signed certificate" = no CA signed it, fine for dev, browser warning in prod.- IPv4 private ranges, know all three: 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16. 169.254.x.x = APIPA (host couldn't get a DHCP lease).- Switch vs router. Switch = Layer 2, MAC addresses, same network segment. Router = Layer 3, IP addresses, between segments. Two different subnets need a router; same subnet uses a switch.- DNS record types. A = IPv4 address. AAAA = IPv6 address. CNAME = alias. MX = mail server. The exam tests which record type to use for a given scenario.- ss -tlnp (or netstat -tlnp on older Linux) shows listening TCP ports and owning processes. The exam's classic answer is netstat; ss is the modern replacement.

Common pitfalls (most of these are from the real build)

  • Inference port bound to 127.0.0.1. The service is healthy on node-1 locally but node-2 gets "connection refused." ss -tlnp | grep 8002 immediately shows the bound address. Change the server's bind address to 0.0.0.0 or the WireGuard virtual IP.
  • Firewall blocking the inter-node port. Port is listening correctly, but ufw drops packets. ping works across the tunnel; curl times out. Add ufw allow in on wg0 to any port 8002 and recheck with ufw status.
  • DNS not resolving the internal name. Both the topology config and the TLS cert's SAN reference node-1.example.internal. If DNS fails, the join handshake fails and cert validation fails (hostname mismatch). Always verify nslookup node-1.example.internal before debugging TLS.
  • Mixing up the two cert chains. The browser CA root and the private mTLS CA root are different files. Importing the wrong root into the wrong trust store gives confusing errors: browser warns even though node auth works, or vice versa. Keep chains labeled and separate.
  • Clock skew breaking TLS. Certs have notBefore/notAfter timestamps. A clock more than a few minutes off causes "certificate not yet valid" or "certificate has expired" on a brand-new cert. Fix it first:
  sudo timedatectl set-ntp true
  timedatectl status   # System clock synchronized: yes
  • WireGuard handshake established but traffic not flowing. wg show shows a recent handshake, but ping 10.100.0.2 times out. Verify AllowedIPs matches the peer's WireGuard IP; check ip route show confirms 10.100.0.0/24 routes via wg0.
  • Static IP not persisting after reboot. ip addr add is ephemeral. Use Netplan (/etc/netplan/) on Ubuntu Server: the config survives reboots.

Recap + what's next

You layered the full network stack for a two-node cluster: physical cabling and switch selection → IP addressing (static assignment in the 192.168.10.0/24 RFC-1918 range) → DNS name resolution → TCP/UDP and port assignments → the TLS handshake and certificate chains → two separate cert chains (public-trust for browsers, private-CA mTLS for nodes) → WireGuard encrypting all inter-node traffic at the VPN level.

Five verification commands tell the story from the bottom up: ip addr (right IP?), ping (Layer 3 reachable?), openssl s_client (TLS chain valid?), wg show (tunnel live?), and curl to the inference port confirming the full chain.

Both nodes are trusted peers; the inference load distributes across them.

Next up: Part 4: "Windows Driver & GPU Troubleshooting for Local LLM Workloads." With two nodes in the cluster, you load a larger model, and one node throws "out of memory" on a card that should have plenty of VRAM. nvidia-smi shows it free, the error is opaque, and the model won't load. We'll walk the GPU stack from Windows Device Manager to the inference server's memory-budgeting configuration, cover the CUDA/driver version-match problem that silently breaks inference containers, and give you the right diagnostic sequence for any GPU workload failure. See you there.