Highly available Kubernetes with batteries for small business

Xiwen Cheng Apr 13, 2021

Kindie (Kubernetes Individual) is an opinionated Kubernetes cluster setup for individuals or small business. Batteries included so that you can hit the ground running and add production workload in no time.

Target audience

Sysadmin, DevOps, Cloud engineer with Linux and Kubernetes experience looking to build a Kubernetes cluster for production usage with bells and whistles focussed on web workloads. You should be able to have the cluster ready in a few hours. If you don’t understand some of the information here, please comment below or research it on the internet. This guide is not meant for complete beginners but we try to keep it as accessible as possible without going into too much details.

Features

Highly available (where possible)
Ingress with LetsEncrypt
NFS central storage
Cluster that scales
Monitoring with Prometheus, Grafana and Loki

Disclaimer

Feel free to change the setup as you wish but you’re on your own. Eventhough we claim this is production ready for ourselves, it might not be for you. So adjust and test this setup further until you are satisfied. We deliberately use root user instead of sudo to save time. And because we know what we are doing (most of the time).

Hardware specifications

Small business Kubernetes cluster

a router with uplink
Synology DS918+ with 16GB memory and 4TB of storage capacity
UPS for data safety
2 NUC with 100 GB disk storage and 16 GB memory each
access to manage a domain (example.dev)
Ubuntu server 20.04 ISO downloaded and on USB stick to install the NUC’s

Architecture

To give you a birds-eye view of what you’re about to build.

Network

The core router serves the internal network 10.0.0.0/16. This is inline with default networks in public cloud services like AWS VPC’s. There’s plenty of room to expand your cluster and you will probably never use all the allocatable addresses here anyway. Of this range we have the following static addresses:

10.0.0.1 => Gateway address on the router
10.0.0.2 => Synology
10.0.1.0 => Floating IP assigned to keepalived master. This address is highly available and therefore used for cluster endpoint of Kubernetes API server and HTTP(s) ingress into the cluster
10.0.1.1 => node1 (this is a Virtual Machine (VM) running in Synology)
10.0.1.2 => node2
10.0.1.3 => node3
10.0.200.0-10.0.200.255 => range reserved for internal loadbalancers (Metal LB)

There’s also an optional UPS supporting the core of the system: router + synology. Synology also exposes the NFS so that nodes can use it as central storage.

Kubernetes

Above merely shows that there are 3 master nodes and N worker nodes where N is larger or equal to zero. Each node will run an ingress controller for HA. In this setup we untaint the master nodes so that regular workloads can be scheduled on them; therefore treat them like worker nodes.

Namespaces

The batteries included are split up in 2 namespaces:

sys: internal misc services needed to support apps; sort of like shared infra services
monitoring: everything related to monitoring

Preparations

Configure router to have as internal network: 10.0.0.0/16 and create the port forward rules as described in the Network Architecture diagram.
Create a DNS record of type A: cluster-endpoint.sys.example.dev => 10.0.1.0
Create a wildcard DNS record of type A (or CNAME if you want): *.app.example.dev => YOUR_PUBLIC_IP
Create another wildcard DNS record of type A (or CNAME if you want): *.sys.example.dev => YOUR_PUBLIC_IP
Setup your Synology and set the address to 10.0.0.2
Setup Synology to allow NFS mounts from 10.0.1.0/24

Synology NFS permission

Create a VM in Synology called node1 with 7GB RAM and 100GB disk, install ubuntu-server:
- set manual IP to 10.0.1.1/16
- set hostname to node1
- install OpenSSH
- create user ops
install all your other physical/dedicated nodes as above (obviously use 10.0.1.2/16 for node2, 10.0.1.3/16 for node3, etc…)

Kubernetes Cluster

At this point you have 3 nodes running: node1, node2 and node3. Because the first 3 nodes are master nodes, we will prepare them all with keepalived and kubeadm. For each node login over SSH to it using the ops username and password you used during installation. After you login switch to root user with sudo su and enter your password again.

Keepalived

apt install -y keepalived

Create a file /etc/keepalived/keepalived.conf with the content:

vrrp_instance VI_1 {
    state MASTER
    interface ens3
    virtual_router_id 101
    priority 100
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass RANDOM_STRING_HERE
    }
    virtual_ipaddress {
        10.0.1.0
    }
}

Replace RANDOM_STRING_HERE with a strong password of your choice if you want (since this is internal network this is not a very big deal).

It is however necessary to set the correct interface name. You can find it with ip a.

After that we can wrap up with:

systemctl enable keepalived
systemctl start keepalived

We use the same keepalived.conf for all master nodes so that the active master is randomly selected. Feel free to adjust the priority if desired to influence the preference.

Kubernetes install

We will use the official installation guide to install Kubernetes:

Container Runtime

cat > /etc/modules-load.d/containerd.conf <<EOF
overlay
br_netfilter
EOF

modprobe overlay
modprobe br_netfilter

# Setup required sysctl params, these persist across reboots.
cat > /etc/sysctl.d/99-kubernetes-cri.conf <<EOF
net.bridge.bridge-nf-call-iptables  = 1
net.ipv4.ip_forward                 = 1
net.bridge.bridge-nf-call-ip6tables = 1
EOF

sysctl --system

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | apt-key add -
add-apt-repository \
    "deb [arch=amd64] https://download.docker.com/linux/ubuntu \
    $(lsb_release -cs) \
    stable"
apt-get update && apt-get install -y containerd.io
mkdir -p /etc/containerd
containerd config default > /etc/containerd/config.toml
systemctl restart containerd

Kubeadm, kubelet, kubectl

apt-get update && apt-get install -y apt-transport-https curl
curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg |  apt-key add -
cat <<EOF | sudo tee /etc/apt/sources.list.d/kubernetes.list
deb https://apt.kubernetes.io/ kubernetes-xenial main
EOF
apt-get update
apt-get install -y kubelet kubeadm kubectl
apt-mark hold kubelet kubeadm kubectl

nfs utils

Because we want to be able to mount NFS shares as PVC.

apt install -y nfs-common

node1

To install our first master node on node1, we first turn off keepalived on node2 and node3:

ssh ops@10.0.1.2 'systemctl stop keepalived'
ssh ops@10.0.1.3 'systemctl stop keepalived'

Now on node1 you can confirm it has the IP 10.0.1.0:

ip a | grep '10.0.1.0'

And confirm your DNS record is set correctly:

host cluster-endpoint.sys.example.dev
cluster-endpoint.sys.example.dev has address 10.0.1.0

After that we are ready to continue:

kubeadm init --apiserver-advertise-address=$(hostname -I | cut -d " " -f1) --control-plane-endpoint=cluster-endpoint.sys.example.dev --upload-certs

Replace the endpoint address.

After a while you will be greeted with a message similar to:

Your Kubernetes control-plane has initialized successfully!

To start using your cluster, you need to run the following as a regular user:

  mkdir -p $HOME/.kube
  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
  sudo chown $(id -u):$(id -g) $HOME/.kube/config

You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
  https://kubernetes.io/docs/concepts/cluster-administration/addons/

You can now join any number of the control-plane node running the following command on each as root:

  kubeadm join cluster-endpoint.sys.example.dev:6443 --token XXXX.XXXX \
    --discovery-token-ca-cert-hash sha256:XXXX \
    --control-plane --certificate-key XXXX

Please note that the certificate-key gives access to cluster sensitive data, keep it secret!
As a safeguard, uploaded-certs will be deleted in two hours; If necessary, you can use
"kubeadm init phase upload-certs --upload-certs" to reload certs afterward.

Then you can join any number of worker nodes by running the following on each as root:

kubeadm join cluster-endpoint.sys.example.dev:6443 --token XXXX.XXXX \
    --discovery-token-ca-cert-hash sha256:XXXX

node2 and node3

To install node2 and node3, login to the node as ops and switch to root then execute:

  kubeadm join cluster-endpoint.sys.example.dev:6443 --token XXXX.XXXX \
    --discovery-token-ca-cert-hash sha256:XXXX \
    --control-plane --certificate-key XXXX

(Obviously, replace the values)

Join workers later

On a master node:

kubeadm token create --print-join-command

Then copy the join command and execute on new worker node

Join masters later

On a master node:

kubeadm init phase upload-certs --upload-certs
# copy certificate key
kubeadm token create --print-join-command --certificate-key $certificate_key

Then copy the join command and execute on new master node

Confirm nodes

On node1 as root execute:

export KUBECONFIG=/etc/kubernetes/admin.conf
kubectl get nodes

It should output something similar to:

NAME    STATUS   ROLES    AGE     VERSION
node1   Ready    master   3d2h    v1.18.3
node2   Ready    master   3d2h    v1.18.3
node3   Ready    master   5h46m   v1.18.3

Let’s untaint the master nodes:

kubectl taint nodes --all node-role.kubernetes.io/master-

CNI (network)

If you do kubectl get pods -A you will see coredns is not starting up correctly:

root@node1:/home/ops# kubectl get pods -A
NAMESPACE     NAME                            READY   STATUS    RESTARTS   AGE
kube-system   coredns-66bff467f8-2bqht        0/1     Pending   0          7m15s
kube-system   coredns-66bff467f8-l7pbt        0/1     Pending   0          7m15s
....

To fix that we need to install a CNI plugin, we choose Calico:

kubectl apply -f https://docs.projectcalico.org/v3.14/manifests/calico.yaml

After a while coredns is running:

root@node1:/home/ops# kubectl get pods -A
NAMESPACE     NAME                                       READY   STATUS    RESTARTS   AGE
kube-system   calico-kube-controllers-789f6df884-b8tsg   1/1     Running   0          4m58s
kube-system   calico-node-9fgqj                          1/1     Running   0          4m59s
kube-system   coredns-66bff467f8-2bqht                   1/1     Running   0          13m
kube-system   coredns-66bff467f8-l7pbt                   1/1     Running   0          13m

Smoke test

To smoke test we can run a job:

root@node1:/home/ops# kubectl run --rm=true -i --tty busybox --image=busybox --restart=Never -- ps
If you don't see a command prompt, try pressing enter.
Error attaching, falling back to logs: unable to upgrade connection: container busybox not found in pod busybox_default
PID   USER     TIME  COMMAND
    1 root      0:00 ps
pod "busybox" deleted

If you do not get output of ps something is broken.

Highly available test

So now we have 3 master nodes running in our cluster. We can test the high availability of the API server. To do that first we need to bring up keepalived on node2 and node3:

ssh ops@10.0.1.2 'systemctl start keepalived'
ssh ops@10.0.1.3 'systemctl start keepalived'

You will notice that node1 currently owns the master IP. Let’s copy the kubeconfig from the node1 to your local machine:

ssh ops@10.0.1.1 'sudo cat /etc/kubernetes/admin.conf' >> ~/.kube/config

Now you should be able to execute kubectl commands from your local machine. Do for instance:

kubectl get nodes
NAME    STATUS   ROLES    AGE     VERSION
node1   Ready    master   3d2h    v1.18.3
node2   Ready    master   3d2h    v1.18.3
node3   Ready    master   5h46m   v1.18.3

Now if you reboot node1, the master IP is automatically taken over by another node. Therefore kubectl commands still work while node1 is being rebooted. As an excercise, find which failover node has the master IP.

Batteries

Now that we have a kubernetes cluster running with 3 masters and a Highly available endpoint for the API server we can continue to setup the services. From now on you can interact with the Kubernetes cluster from your local machine.

Namespace: sys

kubectl create namespace sys

Metal LB

helm repo add stable https://kubernetes-charts.storage.googleapis.com/
helm repo update

cat > metallb-config.yaml <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  namespace: sys
  name: metallb-config
data:
  config: |
    address-pools:
    - name: default
      protocol: layer2
      addresses:
      - 10.0.200.0-10.0.255.0
EOF
kubectl apply -f metallb-config.yaml
helm install metallb stable/metallb --namespace sys

See the metallb helm chart for full configuration options.

Nginx-ingress


cat > nginx-ingress-values.yaml <<EOF
controller:
  kind: DaemonSet
  daemonset:
    useHostPort: true
    hostPorts:
      http: 30080
      https: 30443
  service:
    enabled: false
  metrics:
    enabled: true
    service:
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "10254"

defaultBackend:
  image:
    repository: cinaq/default-backend
    tag: 1.2
  replicaCount: 2
EOF
helm install nginx-ingress stable/nginx-ingress --namespace sys -f nginx-ingress-values.yaml

See the nginx-ingress helm chart for full configuration options.

Cert-manager (Letsencrypt)

helm repo add jetstack https://charts.jetstack.io
helm repo update

# updated on 2020-08-12 ref: https://github.com/jetstack/cert-manager/issues/2752
helm install cert-manager jetstack/cert-manager --namespace sys --version v0.16.1 --set installCRDs=true

cat > issuer_letsencrypt.yaml <<EOF
apiVersion: cert-manager.io/v1alpha2
kind: ClusterIssuer
metadata:
  name: letsencrypt
  namespace: sys
spec:
  acme:
    # The ACME server URL
    server: https://acme-v02.api.letsencrypt.org/directory
    # Email address used for ACME registration
    email: letsencrypt@example.dev
    # Name of a secret used to store the ACME account private key
    privateKeySecretRef:
      name: letsencrypt
    # Enable the HTTP-01 challenge provider
    solvers:
    - http01:
        ingress:
          class:  nginx
EOF
kubectl create -f issuer_letsencrypt.yaml

See the cert-manager helm chart for full configuration options.

NFS client provisioner

helm install nfs-storage stable/nfs-client-provisioner --namespace sys --set nfs.server=10.0.0.2 --set nfs.path=/volume1/kubernetes
kubectl patch storageclass nfs-client -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}

See the nfs-server-provisioner helm chart for full configuration options.

Namespace: monitoring

kubectl create namespace monitoring

Prometheus

cat > prometheus-values.yaml <<EOF
alertmanager:
  replicaCount: 2
pushgateway:
  replicaCount: 2
server:
  replicaCount: 2
  statefulSet:
    enabled: true
EOF
helm install prometheus stable/prometheus -n monitoring -f prometheus-values.yaml

See the prometheus helm chart for full configuration options.

Loki

helm repo add loki https://grafana.github.io/loki/charts
helm repo update

helm install loki loki/loki-stack -n monitoring

See the loki-stack helm chart for full configuration options.

Grafana

cat > grafana-values.yaml <<EOF
persistence:
  enabled: true
replicas: 2
EOF
helm install grafana stable/grafana -n monitoring -f grafana-values.yaml

After the helm install, save the grafana password for later.

See the grafana helm chart for full configuration options.

Expose grafana via Ingress:

cat > grafana-resources.yaml <<EOF
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: grafana-ingress
  namespace: monitoring
  annotations:
    cert-manager.io/cluster-issuer: "letsencrypt"
    nginx.ingress.kubernetes.io/proxy-body-size: 1m
    nginx.ingress.kubernetes.io/server-snippet: |
      # IP white-listing
      allow 192.168.1.0/24;
      allow YOUR_PUBLIC_IP;
      deny all;
spec:
  tls:
  - hosts:
    - grafana.sys.example.dev
    secretName: dev-grafana-sys-grafana-tls
  rules:
  - host: grafana.sys.example.dev
    http:
      paths:
      - path: /
        backend:
          serviceName: grafana
          servicePort: 80
EOF

Now you should be able to visit grafana via the public URL: http://grafana.sys.example.dev and notice you get redirected automatically to HTTPS and it’s signed by LetsEncrypt.

After login configure the 2 datasources:

loki: http://loki:3100
prometheus: http://prometheus-server

Then import the dashboards:

Results

Your deployment is now complete. It should look like:

$ kubectl get pods -A -o wide
NAMESPACE     NAME                                                 READY   STATUS    RESTARTS   AGE     IP                NODE    NOMINATED NODE   READINESS GATES
kube-system   calico-kube-controllers-789f6df884-g2gm4             1/1     Running   2          3d2h    192.168.166.149   node1   <none>           <none>
kube-system   calico-node-8nv5r                                    1/1     Running   1          6h27m   10.0.1.3          node3   <none>           <none>
kube-system   calico-node-srxdd                                    1/1     Running   3          3d2h    10.0.1.1          node1   <none>           <none>
kube-system   calico-node-tslz8                                    1/1     Running   6          3d2h    10.0.1.2          node2   <none>           <none>
kube-system   coredns-66bff467f8-6dsk5                             1/1     Running   1          5h24m   192.168.135.30    node3   <none>           <none>
kube-system   coredns-66bff467f8-z2b9h                             1/1     Running   2          3d2h    192.168.166.148   node1   <none>           <none>
kube-system   etcd-node1                                           1/1     Running   3          3d2h    10.0.1.1          node1   <none>           <none>
kube-system   etcd-node2                                           1/1     Running   7          3d2h    10.0.1.2          node2   <none>           <none>
kube-system   etcd-node3                                           1/1     Running   1          6h27m   10.0.1.3          node3   <none>           <none>
kube-system   kube-apiserver-node1                                 1/1     Running   8          3d2h    10.0.1.1          node1   <none>           <none>
kube-system   kube-apiserver-node2                                 1/1     Running   9          3d2h    10.0.1.2          node2   <none>           <none>
kube-system   kube-apiserver-node3                                 1/1     Running   1          6h27m   10.0.1.3          node3   <none>           <none>
kube-system   kube-controller-manager-node1                        1/1     Running   35         3d2h    10.0.1.1          node1   <none>           <none>
kube-system   kube-controller-manager-node2                        1/1     Running   35         3d2h    10.0.1.2          node2   <none>           <none>
kube-system   kube-controller-manager-node3                        1/1     Running   4          6h27m   10.0.1.3          node3   <none>           <none>
kube-system   kube-proxy-cj42b                                     1/1     Running   5          3d2h    10.0.1.2          node2   <none>           <none>
kube-system   kube-proxy-nt7zn                                     1/1     Running   2          3d2h    10.0.1.1          node1   <none>           <none>
kube-system   kube-proxy-s8vgt                                     1/1     Running   1          6h27m   10.0.1.3          node3   <none>           <none>
kube-system   kube-scheduler-node1                                 1/1     Running   30         3d2h    10.0.1.1          node1   <none>           <none>
kube-system   kube-scheduler-node2                                 1/1     Running   33         3d2h    10.0.1.2          node2   <none>           <none>
kube-system   kube-scheduler-node3                                 1/1     Running   5          6h27m   10.0.1.3          node3   <none>           <none>
monitoring    grafana-74f7c48746-9dvxf                             1/1     Running   0          3h31m   192.168.104.53    node2   <none>           <none>
monitoring    grafana-74f7c48746-txwrv                             1/1     Running   0          3h30m   192.168.135.43    node3   <none>           <none>
monitoring    loki-0                                               1/1     Running   0          4h39m   192.168.104.47    node2   <none>           <none>
monitoring    loki-promtail-785qg                                  1/1     Running   4          3d1h    192.168.104.3     node2   <none>           <none>
monitoring    loki-promtail-8fnkw                                  1/1     Running   1          3d1h    192.168.166.151   node1   <none>           <none>
monitoring    loki-promtail-8vwpf                                  1/1     Running   1          6h27m   192.168.135.37    node3   <none>           <none>
monitoring    prometheus-alertmanager-6fcfd7bb84-mvm9k             2/2     Running   2          5h11m   192.168.135.33    node3   <none>           <none>
monitoring    prometheus-alertmanager-6fcfd7bb84-ndbhd             2/2     Running   0          3h27m   192.168.104.61    node2   <none>           <none>
monitoring    prometheus-kube-state-metrics-79f5b77cb8-4kh9x       1/1     Running   1          5h24m   192.168.135.27    node3   <none>           <none>
monitoring    prometheus-node-exporter-278sb                       1/1     Running   1          6h22m   10.0.1.3          node3   <none>           <none>
monitoring    prometheus-node-exporter-czrbw                       1/1     Running   4          3d      10.0.1.2          node2   <none>           <none>
monitoring    prometheus-node-exporter-xfw7s                       1/1     Running   1          3d      10.0.1.1          node1   <none>           <none>
monitoring    prometheus-pushgateway-5d85697467-88mp5              1/1     Running   0          3h27m   192.168.104.23    node2   <none>           <none>
monitoring    prometheus-pushgateway-5d85697467-hff9t              1/1     Running   1          5h24m   192.168.135.38    node3   <none>           <none>
monitoring    prometheus-server-0                                  2/2     Running   0          3h21m   192.168.104.19    node2   <none>           <none>
monitoring    prometheus-server-1                                  2/2     Running   0          3h20m   192.168.135.44    node3   <none>           <none>
sqirly        postgresql-545d95dcb9-npnbj                          1/1     Running   0          5h12m   192.168.166.153   node1   <none>           <none>
sqirly        sqirly-5d674b8d5b-ktbsw                              1/1     Running   1          6h16m   192.168.135.41    node3   <none>           <none>
sqirly        sqirly-5d674b8d5b-mnzzv                              1/1     Running   5          5h24m   192.168.166.152   node1   <none>           <none>
sys           cert-manager-678bc78d5d-gmb86                        1/1     Running   1          5h24m   192.168.135.26    node3   <none>           <none>
sys           cert-manager-cainjector-77bc84779-bq9xx              1/1     Running   4          5h24m   192.168.135.36    node3   <none>           <none>
sys           cert-manager-webhook-5b5485577f-5wz6c                1/1     Running   1          5h24m   192.168.135.40    node3   <none>           <none>
sys           distcc-deployment-5d6fb547d7-pjhd7                   1/1     Running   1          5h24m   192.168.135.42    node3   <none>           <none>
sys           metallb-controller-9f46bdfcb-zbtsw                   1/1     Running   1          6h12m   192.168.135.39    node3   <none>           <none>
sys           metallb-speaker-4bpqd                                1/1     Running   1          3d2h    10.0.1.1          node1   <none>           <none>
sys           metallb-speaker-t2jpt                                1/1     Running   4          3d2h    10.0.1.2          node2   <none>           <none>
sys           metallb-speaker-w4q2s                                1/1     Running   1          6h27m   10.0.1.3          node3   <none>           <none>
sys           minio-6df88b9995-x8qpt                               1/1     Running   1          5h12m   192.168.135.31    node3   <none>           <none>
sys           nfs-storage-nfs-client-provisioner-8fcb6b749-nskl4   1/1     Running   1          5h12m   192.168.135.28    node3   <none>           <none>
sys           nginx-ingress-controller-jxhtn                       1/1     Running   0          3h57m   192.168.166.159   node1   <none>           <none>
sys           nginx-ingress-controller-kk6sn                       1/1     Running   0          3h57m   192.168.104.20    node2   <none>           <none>
sys           nginx-ingress-controller-s4ndr                       1/1     Running   1          3h57m   192.168.135.32    node3   <none>           <none>
sys           nginx-ingress-default-backend-5c667c8479-hn769       1/1     Running   1          5h24m   192.168.135.29    node3   <none>           <none>
sys           nginx-ingress-default-backend-5c667c8479-zhnl8       1/1     Running   0          4h53m   192.168.166.154   node1   <none>           <none>

Conclusions

This setup is not truly highly available. The whole cluster depends on the Synology as data storage. You could improve this further by replacing the centralized NAS with a distributed solution. But besides that the cluster is very solid and scalable. Rebooting any of the NUC’s, your application experiences almost zero down time. In case of a node outage, requests active on the broken node will be aborted. Also if the broken node happens to be the active master. But it will failover automatically to another master node.

Menu

Options