Kubernetes on-prem HA setup guide (k8s v1.27, kernel v6.6.8, Cilium v1.15.4 & HAProxy)

A tutorial on deploying a highly-available Kubernetes cluster on-prem

This tutorial deploys a highly available Kubernetes 1.27* cluster with:

  • Cilium for the CNI + IP address ranges of your choosing
  • Hubble for network monitoring
  • containerd for the CRI
  • HAProxy** for load balancing the API server
  • on RedHat-like operating systems

*Cilium is unstable with Kubernetes above v1.27.
**I assume you have a basic HAProxy instance running.

Before you start

You’ll need 5 VMs. One for HAProxy, two for control plane hosts and two for workers. In this guide, HAProxy, control plane and worker commands look different:

HAProxy node
echo "I'm a HAProxy command"
Control plane nodes
echo "only run me on the control plane nodes, or a single node if the title says so"
Worker nodes
echo "run me on worker nodes only"

Step 1: HAProxy for kubeapi.example.net

Make a DNS record for kubeapi.example.net however you normally would. Point it at your HAProxy server.

Edit /etc/haproxy/haproxy.cfg to add a frontend for port 6443 traffic:

HAProxy node
frontend frontend_kubecp
  bind 0.0.0.0:6443
  mode tcp
  use_backend backend_kubecp
/etc/haproxy/haproxy.cfg

also add a backend pointing to the two control plane servers:

HAProxy node
backend backend_kubecp
  mode tcp
  balance roundrobin
  option forwardfor
  server kube1.example.net kube1.example.net:6443 check
  server kube2.example.net kube2.example.net:6443 check
/etc/haproxy/haproxy.cfg

Save haproxy.cfg and reload:

HAProxy node
systemctl restart haproxy

Check that your API server is reachable. You can do wget kubeapi.example.net:6443 (error 400 would be expected) or if you have the stats endpoint enabled in HAProxy you can see visually.

Screenshot showing HAProxy stats page with two backends for two Kubernetes API servers working with no downtime

Step 2: prepare all Kubernetes nodes

This guide expects a modern Linux kernel that supports socket-LB. This means you can use cool eBPF features in Cilium, and you’ll need kernel v4.19.57, v5.1.16, v5.2.0 or greater. If you don’t want to upgrade your kernel, remove the line --skip-phases=addon/kube-proxy in the kubeadm init section.

Upgrading the Linux kernel on RHEL-based systems (incl. Almalinux/Rocky)…
Control plane nodes
# become root
su root # or sudo -i

# add the repo for elrepo.org where the kernel-ml module is hosted
rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
dnf install -y https://www.elrepo.org/elrepo-release-8.0-2.el8.elrepo.noarch.rpm

# do the install
dnf -y --enablerepo=elrepo-kernel install kernel-ml

# reboot
reboot

Repeat that for worker nodes too… same thing:

Worker nodes
# become root
su root # or sudo -i

# add the repo for elrepo.org where the kernel-ml module is hosted
rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
dnf install -y https://www.elrepo.org/elrepo-release-8.0-2.el8.elrepo.noarch.rpm

# do the install
dnf -y --enablerepo=elrepo-kernel install kernel-ml

# reboot
reboot

Run this on all k8s nodes, regardless of worker/control plane type.

Control plane nodes
# become root
su root # or sudo -i

# disable SELinux
setenforce 0
sed -i 's/^SELINUX=enforcing$/SELINUX=permissive/' /etc/selinux/config

# disable and stop firewalld (trust me on this one, don't bother making rules - lxc bridge interfaces will waste your time later if you try)
systemctl stop firewalld
systemctl disable firewalld

# clear out any existing / old Kubernetes configs!
rm -rf /var/lib/etcd/*
rm -rf /etc/cni/net.d/*
rm -rf /etc/kubernetes/*

# turn off swap and disable on startup
swapoff -a
sed -e '/swap/ s/^#*/#/' -i /etc/fstab # comments out fstab lines containing 'swap'

# prepare yum
yum update -y
yum install -y yum-utils
yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo

# install docker/containerd
yum install -y docker-ce docker-ce-cli containerd.io --allowerasing
systemctl enable docker && systemctl start docker

# sort out cgroup driver
cat <<EOT> /etc/docker/daemon.json
{
  "exec-opts": ["native.cgroupdriver=systemd"],
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "100m"
  },
  "storage-driver": "overlay2"
}
EOT

# reload docker
systemctl daemon-reload
systemctl restart docker

# Set up containerd
cat <<EOT > /etc/containerd/config.toml
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true
EOT
systemctl restart containerd

# prepare repos
cat <<EOF | tee /etc/yum.repos.d/kubernetes.repo
[kubernetes]
name=Kubernetes
baseurl=https://pkgs.k8s.io/core:/stable:/v1.27/rpm/
enabled=1
gpgcheck=1
gpgkey=https://pkgs.k8s.io/core:/stable:/v1.27/rpm/repodata/repomd.xml.key
exclude=kubelet kubeadm kubectl cri-tools kubernetes-cni
EOF

# install the binaries
yum install -y kubelet kubeadm kubectl --disableexcludes=kubernetes
systemctl enable --now kubelet

# set the KUBECONFIG environment variable
export KUBECONFIG=/etc/kubernetes/admin.conf
Then exactly the same for worker nodes…
Worker nodes
# become root
su root # or sudo -i

# disable SELinux
setenforce 0
sed -i 's/^SELINUX=enforcing$/SELINUX=permissive/' /etc/selinux/config

# disable and stop firewalld (trust me on this one, don't bother making rules - lxc bridge interfaces will waste your time later if you try)
systemctl stop firewalld
systemctl disable firewalld

# clear out any existing / old Kubernetes configs!
rm -rf /var/lib/etcd/*
rm -rf /etc/cni/net.d/*
rm -rf /etc/kubernetes/*

# turn off swap and disable on startup
swapoff -a
sed -e '/swap/ s/^#*/#/' -i /etc/fstab # comments out fstab lines containing 'swap'

# prepare yum
yum update -y
yum install -y yum-utils
yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo

# install docker/containerd
yum install -y docker-ce docker-ce-cli containerd.io --allowerasing
systemctl enable docker && systemctl start docker

# sort out cgroup driver
cat <<EOT> /etc/docker/daemon.json
{
  "exec-opts": ["native.cgroupdriver=systemd"],
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "100m"
  },
  "storage-driver": "overlay2"
}
EOT

# reload docker
systemctl daemon-reload
systemctl restart docker

# Set up containerd
cat <<EOT > /etc/containerd/config.toml
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true
EOT
systemctl restart containerd

# prepare repos
cat <<EOF | tee /etc/yum.repos.d/kubernetes.repo
[kubernetes]
name=Kubernetes
baseurl=https://pkgs.k8s.io/core:/stable:/v1.27/rpm/
enabled=1
gpgcheck=1
gpgkey=https://pkgs.k8s.io/core:/stable:/v1.27/rpm/repodata/repomd.xml.key
exclude=kubelet kubeadm kubectl cri-tools kubernetes-cni
EOF

# install the binaries
yum install -y kubelet kubeadm kubectl --disableexcludes=kubernetes
systemctl enable --now kubelet

# set the KUBECONFIG environment variable
export KUBECONFIG=/etc/kubernetes/admin.conf

Step 3: install Cilium CLI binary

Control plane nodes
# install go
yum install -y go

# install Cilium CLI binary
CILIUM_CLI_VERSION=$(curl -s https://raw.githubusercontent.com/cilium/cilium-cli/main/stable.txt)
GOOS=$(go env GOOS)
GOARCH=$(go env GOARCH)
curl -L --remote-name-all https://github.com/cilium/cilium-cli/releases/download/${CILIUM_CLI_VERSION}/cilium-${GOOS}-${GOARCH}.tar.gz{,.sha256sum}
sha256sum --check cilium-${GOOS}-${GOARCH}.tar.gz.sha256sum
tar -C /usr/local/bin -xzvf cilium-${GOOS}-${GOARCH}.tar.gz
rm -rf cilium-${GOOS}-${GOARCH}.tar.gz{,.sha256sum}

Step 4: kubeadm init (start the cluster)

Change the variables below to suit your environment:

  • YOUR_API_ENDPOINT
  • POD_NETWORK_CIDR
  • SERVICE_CIDR
  • SERVICE_DNS_DOMAIN
FIRST CP node only
# ! only on the first control plane node !

# create the cluster using kubeadm

# set these variables:
YOUR_API_ENDPOINT="kubeapi.example.net" # a DNS record that resolves in your network
POD_NETWORK_CIDR="10.3.128.0/18" # IP addresses your k8s pods will use
SERVICE_CIDR="10.3.192.0/20" # IP addresses that your k8s services will use
SERVICE_DNS_DOMAIN="k8s.example.net" # svc 'x' becomes x.default.svc.k8s.example.net
# ends

# run kubeadm (it will not be instant)
kubeadm init \
	--upload-certs \
	--control-plane-endpoint "$YOUR_API_ENDPOINT:6443" \
	--pod-network-cidr $POD_NETWORK_CIDR \
	--service-cidr $SERVICE_CIDR \
	--service-dns-domain $SERVICE_DNS_DOMAIN \
	--cri-socket unix:///run/containerd/containerd.sock \
	--skip-phases=addon/kube-proxy # only if v4.19.57, v5.1.16, v5.2.0 kernel is present
	
	# verify the API server works (ignore pending coredns pods)
	kubectl get pods -n kube-system

Note down the two join commands (for more CP nodes, and for worker nodes)

Step 5: join the other control plane node

OTHER CP nodes only
# ! only on the OTHER control plane node !

# join to the cluster using the join command `kubeadm init` gave us previously
kubeadm join kubeapi.example.net:6443 --token xxxxxx.xxxxxxxxxxxxxxxxxxxx\
      --discovery-token-ca-cert-hash sha256:xxxxxxxxxxxxxxxxxxxx\
      --control-plane --certificate-key xxxxxxxxxxxxxxxxxxxx

Step 6: join worker nodes and label them

Worker nodes
# perform the join
kubeadm join kubeapi.example.net:6443 --token xxxxxx.xxxxxxxxxxxxxxx--discovery-token-ca-cert-hash sha256:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Any control plane node
# label the nodes as workers
kubectl label node worker1.example.net node-role.kubernetes.io/worker=worker
kubectl  label node worker2.example.net node-role.kubernetes.io/worker=worker

In a few seconds, all nodes should be Ready. Check for yourself:

Control plane nodes
# check status of all nodes. 'NotReady' is normal as we haven't installed Cilium yet!
kubectl get nodes -o wide

Step 7: verify the API survives a reboot

Save yourself the headache later. Do a reboot to check everything comes up.

Control plane nodes
# reboot
reboot
Control plane nodes
# check
export KUBECONFIG=/etc/kubernetes/admin.conf
kubectl get pods -n kube-system

If you get an error like Unable to connect to the server: EOF, try the below

Troubleshooting kube-apiserver issues…
  • is the firewall disabled?
  • is swap disabled ? (free | grep Swap)
  • is the kubelet service running?
  • anything obvious in journalctl -xeu kubelet?

Step 8: install Cilium (container networking)

Control plane nodes
# install Helm (grabs the latest version from their official repo)
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh

Time-waster warning: if you want to specify the podCIDR range, something seems wrong with the Helm method you may encounter on the Internet. It typically looks like the hidden section below.

bad stuff…
helm install cilium cilium/cilium \
  --namespace kube-system \
  --set k8sServiceHost=kubeapi.example.net \
  --set k8sServicePort=6443 \
  --set ipv4NativeRoutingCIDR=10.3.128.0/18 \
  --set clusterPoolIPv4PodCIDRList="10.3.128.0/18" \ # doesn't apply
  --set ipam.operator.clusterPoolIPv4MaskSize=24 \
  --set ipv4.enabled=true \
  --set kubeProxyReplacement=strict \

Don’t do that. Instead make a file called cilium.yaml and Cilium’s own cilium binary we just installed above to apply it (changing the # change me! lines):

FIRST CP node only
version:
  1.15.4
namespace:
  kube-system
cluster:
  id: 0
  name: kubernetes
encryption:
  nodeEncryption: false
ipv6:
  enabled: false
ipam:
  mode: cluster-pool
  operator:
    clusterPoolIPv4MaskSize: 24
    clusterPoolIPv4PodCIDRList:
      - "10.3.128.0/18"
k8sServiceHost: kubeapi.example.net
k8sServicePort: 6443
kubeProxyReplacement: strict
ingressController:
  enabled: true
  default: true
operator:
  replicas: 1
serviceAccounts:
  cilium:
    name: cilium
  operator:
    name: cilium-operator
tunnel: vxlan
hubble:
  enabled: true
  ui:
    enabled: true # you should set up an ingress with an ingress controller later
  metrics:
    enabled:
    - dns:query;ignoreAAAA
    - drop
    - tcp
    - flow
    - port-distribution
    - icmp
    - http
    enableOpenMetrics: true
  peerService:
    clusterDomain: k8s.example.net
  relay:
    enabled: true
  tls:
    enabled: true
envoy:
  enabled: true
prometheus:
  enabled: true
~/cilium.yaml

Now apply it:

FIRST CP node only
# ! only on the first control plane node !

# install Cilium with cilium.yaml values
cilium install --helm-values cilium.yaml
~/cilium.yaml
If coredns pods don’t come up…
FIRST CP node only
# ! only on the first control plane node !

# Try restarting coredns pods 
kubectl get pods --all-namespaces -o custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name,HOSTNETWORK:.spec.hostNetwork --no-headers=true | grep '<none>' | awk '{print "-n "$1" "$2}' | xargs -L 1 -r kubectl delete pod
~/cilium.yaml

Verify things look good:

Any control plane node
# check all kube-system pods are healthy
kubectl get pods -n kube-system

# check Cilium is using the right IP range for pods:
kubectl get cm -n kube-system cilium-config -o yaml | grep -i cluster

Step 9: create a real pod

The moment of truth!

Any control plane node
# deploy an Alpine Linux image and drop to a shell
kubectl run -it --image=alpine:3.6 alpine -- sh

# ping an IP on the Internet (below is my blog)
ping 81.187.86.89

# resolve something
nslookup blog.abctaylor.com

# exit
exit

If you check all pods running (k get pods -o wide --all-namespaces) you should see exactly this (with different random IDs of course):

I’d recommend doing another test reboot to make sure your new cluster survives!

Step 10: enable Hubble (Cilium monitoring)

Earlier in the cilium.yaml file above, we added some config for Hubble. It made a pod called hubble-ui that runs a UI server. Currently somewhat insecurely due to a bug (I suspect with TLS in Cilium and the Hubble UI system). Let’s change the ClusterIP deployment to a NodePort, get the port and then open it in a browser:

Any control plane node
# change hubble-ui's service ClusteIP to NodePort
kubectl patch svc hubble-ui -n kube-system -p '{"spec": {"type": "NodePort"}}'

# get the port allocated
kubectl get svc -n kube-system | grep hubble-ui

For example, my cluster allocated me :30401. Open it in Chrome on any node:

Thanks for reading!

This post took about two dozen fresh cluster builds to refine every little bit. Please contact me if you have any feedback.