RSS

Fault Tolerance Solutions for Multi-Cluster Kubernetes: Part 1

Blueprints for running multi-cluster Kubernetes architectures for stateless and stateful workloads. Active-Active User Traffic Routing.

Supervisor: Jack Munday

My summer at CERN OpenLab

This summer I had the opportunity to work as a summer student at CERN OpenLab, where I was part of the IT-CD team. My project focused on implementing fault tolerance solutions for multi-cluster Kubernetes deployments, specifically using Cilium Cluster Mesh and various database technologies. The goal was to ensure high availability and redundancy for user traffic and databases across multiple clusters.

I had a great summer at CERN, I learnt a lot both professionally and personally, and I had the chance to meet many interesting people. Without further a do, let’s go into it!

OpenLab summer students.

OpenLab Summer Students class of 2025.

Introduction

High Availability is the design of systems and services to ensure they remain operation and accessible with minimal downtime, even in the face of server failures. It is a broader field of Business Continuity, which is a set of strategies to keep business operations during and after disruptions. The four levels of Business Continuity are described in the diagram below. The levels go from Cold with hours or even days of downtime, to Active-Active with practically no downtime.

Levels of Business Continuity.

Levels of Business Continuity. Inspired by CERN’s BC/DR Team.

In these two articles, we will focus on Active-Active and Active-Passive levels. Active-Active is a configuration where multiple clusters are running the same application simultaneously, sharing the load and providing redundancy. Active-Passive is a configuration where one cluster is active and serving requests, while another cluster is on standby, ready to take over in case of failure. In the context of this article, Active-Passive only provides redundancy for databases, which are continuously replicated across clusters. In the table above, the Active-Passive covers all the levels before Active-Active. The second article is about Active-Passive and is released on this same blog.

Active-Active Setup for Applications

Cilium is a Kubernetes CNI (Container Network Interface) plugin that provides advanced networking and security features for Kubernetes clusters. It is designed to enhance the performance, scalability, and security of containerized applications. Cilium uses eBPF (extended Berkeley Packet Filter) technology to implement networking and security policies at the kernel level, allowing for efficient and flexible network management.

Cilium Cluster Mesh is a feature that allows multiple Kubernetes clusters to be connected and managed as a single logical network. This enables seamless communication between pods across different clusters. Cluster Mesh is particularly useful for multi-cluster deployments, where applications need to span multiple clusters for high availability. This feature differs from service mesh, which is a layer that provides communication between services within a cluster, whereas Cluster Mesh focuses on communication between clusters.

The benefit of Cilium Cluster Mesh is that it allows for seamless communication between pods across different clusters, enabling load balancing and failover capabilities. It allows user traffic to be distributed across multiple clusters, ensuring that if one cluster fails, the other clusters can continue to serve requests. Furthermore, with Cilium Cluster Mesh it is seamless to label services as global, which allows them to be discovered and accessed from any cluster in the mesh. This way it is also easy to group pods together, as the pods with the same names in different groups perform the load balancing just between each other.

This chapter will cover the setup of Cilium Cluster Mesh for Active-Active user traffic. The setup involves configuring multiple pods in different clusters, and then load balancing the traffic across these pods. The goal is to ensure that user requests are distributed evenly across the clusters, providing redundancy and high availability even if one of the clusters fail. The diagram below illustrates the architecture used for this setup, with the API- and ML-services running in different clusters, and the user traffic being load balanced across them pairwise.

Cilium Cluster Mesh architecture.

Cilium Cluster Mesh architecture.

Cilium Cluster Mesh Basic Installation

Installing Cilium and Cilium Cluster Mesh is straightforward with the Cilium CLI, and you can follow this guide by Cilium to get it installed.

However, users may encounter issues with the Cilium Cluster Mesh installation via Cilium CLI, especially if they have a larger umbrella Helm chart for all their installations. Furthermore, when running the cilium clustermesh connect command on the CLI installation, the Cilium installation exceeded Helm release size limit of 1 MB. To overcome this, one can install Cilium Cluster Mesh manually with Helm. Let’s assume that one has two clusters named cilium-001and cilium-002, both with certmanager installed. On a high level, it can be done as follows:

  1. Create the Kubernetes clusters and install Cilium in them via Helm.
# Run this against both of the Kubernetes clusters.
helm repo add cilium https://helm.cilium.io/
helm install -n kube-system cilium cilium/cilium --create-namespace --version 1.18.0
  1. We ran multiple helm upgrades to register and mesh our clusters together. First installing cilium with clustermesh.useAPIServer=true and then enabling clustermesh in a subsequent upgrade with the relevant configuration for all clusters. Below is presented the final configuration for brevity for cilium-002, the configuration for cilium-001 is similar with a different CIDR range and the certificates for cilium-002 instead.
---
# cilium-002.yaml
cilium: 
  cluster:
    name: <CILIUM-002-MASTER-NODE-NAME>  # Master node name from `kubectl get no`
    id: 002 # Cluster ID, can be any number, but should be unique across clusters.
  ipam:
    operator:
      clusterPoolIPv6MaskSize: 120
      clusterPoolIPv4MaskSize: 24
      clusterPoolIPv6PodCIDRList:
        - 2001:4860::0/108
      clusterPoolIPv4PodCIDRList: # Ensure each cluster in your mesh uses a different CIDR range.
        - 10.102.0.0/16
  bpf:     # Mandatory to fix issue mentioned in https://github.com/cilium/cilium/issues/20942
    masquerade: true 
  clustermesh:
   apiserver:
    tls:
      server:
        extraDnsNames:
          - "*.cern.ch" # If you are relying on cilium to generate your certificates.
    useAPIServer: true
    config:
      enabled: true
      domain: cern.ch
      clusters:
        - name: <CILIUM-001-MASTER-NODE-NAME> # Second cluster master node name from kubectl get no.
          port: 32379
          ips:
            - <CILIUM-001-MASTER-NODE-IP> # Second cluster internal IP address from kubectl get no -owide.
          tls: # Certificates can be retrieved with `kubectl get secret -n kube-system clustermesh-apiserver-remote-cert -o jsonpath='{.data}'`
            key: <APISERVER-KEY>
            cert: <APISERVER-CERT>
            caCert: <APISERVER-CA-CERT>

Load Balancer Setup

To enable external access to the cluster with this integration, an ingress must be deployed, which in turn automatically provisions a load balancer for the cluster. Ingress-nginx ingress controller is used for this purpose, as with Cilium ingress controller I encountered problems with the cluster networking (see more in the Troubleshooting section). Install the ingress-nginx controller with the following Helm configuration:

# cilium-002-ingress-controller.yaml
ingress-nginx:
  controller:
    nodeSelector:
      role: ingress
    service:
      enabled: true
      nodePorts:
        http: ""
        https: ""
      type: LoadBalancer
  enabled: true

Since the Helm configuration deployed the load balancer into a node with label ingress, we should label one as such, preferably before the ingress-nginx installation:

kubectl label node <NODE-OF-CHOICE> role=ingress

Next up, we should deploy the ingress and thus the load balancer by applying this custom resource definition (CRD):

# ingress-manifest.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: sample-http-ingress
  annotations:
    # This annotation added to get the setup to work
    # Read more at https://github.com/cilium/cilium/issues/25818#issuecomment-1572037533 
    # and at https://github.com/kubernetes/ingress-nginx/blob/main/docs/user-guide/nginx-configuration/annotations.md#service-upstream
    nginx.ingress.kubernetes.io/service-upstream: "true"  
spec:
  ingressClassName: nginx
  rules:
  - host:
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: api-service # Name of a global service backend. If you would deploy another service, you would need to change this name to ML-service or something else.
            port:
              number: 8080

Deploy this CRD with kubectl apply -f ingress-manifest.yaml. The load balancer will be provisioned automatically, and the ingress controller will start routing traffic to the specified backend service later. Apply this in both clusters. Then, configure DNS load balancing by assigning the same DNS name to the external addresses of both load balancers. This way, when clients resolve the DNS name, the DNS service distributes requests across the available load balancers. This step depends on the DNS service you are using.

Global Services

The automatic load-balancing between clusters can be achieved by defining Kubernetes ClusterIP services with identical names and namespaces and by adding the annotation service.cilium.io/global: "true" to declare them global. Cilium will take care of the rest. Furthermore, since this guide is utilizing an external ingress controller, an additional annotation is needed for the global services, namely service.cilium.io/global-sync-endpoint-slices: "true". Apply the following CRD in both of the clusters to create the global service ClusterIP, and mock pods within them:

# global-api-service-manifest.yaml
apiVersion: v1
kind: Service
metadata:
  # The name and namespace need to be the same across services in different clusters. This name is important as it defines the load balancing groups for Cilium.
  name: api-service
  annotations:
    # Declare the global service.
    # Read more here: https://docs.cilium.io/en/stable/network/clustermesh/services/
    service.cilium.io/global: "true"
    # Allow the service discovery with third-party ingress controllers.
    service.cilium.io/global-sync-endpoint-slices: "true"
spec:
  type: ClusterIP
  ports:
  - name: http
    protocol: TCP
    port: 8080
    targetPort: 80
  selector:
    app: api-service
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-service
spec:
  replicas: 1
  selector:
    matchLabels:
      app: api-service
  template:
    metadata:
      labels:
        app: api-service
    spec:
      containers:
      - name: nginx
        image: nginx
        ports:
        - containerPort: 80
        volumeMounts:
        - name: html
          mountPath: /usr/share/nginx/html/index.html
          subPath: index.html
      volumes:
      - name: html
        configMap:
          name: custom-index-html
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: custom-index-html
data:
  # Hello from Cluster 001 or Cluster 002, depending on the cluster.
  index.html: |
        Hello from Cluster 00x

Deploy with kubectl apply -f global-api-service-manifest.yaml, after which one can check that the global service is working by checking that we can get responses from both of the clusters:

# Run a pod which can access the cluster services.
kubectl run curlpod --rm -it --image=busybox -- sh  
# This should return "Hello from Cluster 001" or "Hello from Cluster 002" depending on which cluster the request was routed to.
wget -qO- http://api-service:8080  

Testing Cluster Mesh Connectivity

Now everything should be working. You can test the solution in many ways, below listed a couple methods:

  1. By using Cilium CLI (needs Cilium CLI installed):
    # To check that normal Cilium features are working.
    cilium status
    
    # Expected output
        /¯¯\
    /¯¯\__/¯¯\    Cilium:             OK
    \__/¯¯\__/    Operator:           OK
    /¯¯\__/¯¯\    Envoy DaemonSet:    OK
    \__/¯¯\__/    Hubble Relay:       OK
        \__/       ClusterMesh:        OK
    
    DaemonSet              cilium                   Desired: 2, Ready: 2/2, Available: 2/2
    DaemonSet              cilium-envoy             Desired: 2, Ready: 2/2, Available: 2/2
    Deployment             cilium-operator          Desired: 2, Ready: 2/2, Available: 2/2
    Deployment             clustermesh-apiserver    Desired: 1, Ready: 1/1, Available: 1/1
    Deployment             hubble-relay             Desired: 1, Ready: 1/1, Available: 1/1
    Deployment             hubble-ui                Desired: 1, Ready: 1/1, Available: 1/1
    Containers:            cilium                   Running: 2
                           cilium-envoy             Running: 2
                           cilium-operator          Running: 2
                           clustermesh-apiserver    Running: 1
                           hubble-relay             Running: 1
                           hubble-ui                Running: 1
    Cluster Pods:          27/27 managed by Cilium
    Helm chart version:    1.17.5
    Image versions         cilium                   quay.io/cilium/cilium:v1.17.5 2
                           cilium-envoy             quay.io/cilium/cilium-envoy:v1.32.7 2
                           cilium-operator          quay.io/cilium/operator-generic:v1.17.5 2
                           clustermesh-apiserver    quay.io/cilium/clustermesh-apiserver:v1.17.5 3
                           hubble-relay             quay.io/cilium/hubble-relay:v1.17.5 1
                           hubble-ui                quay.io/cilium/hubble-ui-backend:v0.13.2 1
                           hubble-ui                quay.io/cilium/hubble-ui:v0.13.2@sha256 1
    
    # To check if Cluster Mesh installation is working. You can run this both on cilium-001 and cilium-002 clusters, and the output should be similar in both. Example ran on cilium-001.
    cilium clustermesh status 
    
    # Expected output
    ⚠️  Service type NodePort detected! Service may fail when nodes are removed from the cluster!
    ✅ Service "clustermesh-apiserver" of type "NodePort" found
    ✅ Cluster access information is available:
      - <CILIUM-001-MASTER-IP>:32379
    ✅ Deployment clustermesh-apiserver is ready
    ℹ️  KVStoreMesh is enabled
    
    ✅ All 2 nodes are connected to all clusters [min:1 / avg:1.0 / max:1]
    ✅ All 1 KVStoreMesh replicas are connected to all clusters [min:1 / avg:1.0 / max:1]
    
    🔌 Cluster Connections:
      - <CILIUM-002-MASTER-NODE-NAME>: 2/2 configured, 2/2 connected - KVStoreMesh: 1/1 configured, 1/1 connected
    
    🔀 Global services: [ min:1 / avg:1.0 / max:1 ]
    
    # To test the Cluster Mesh connection.
    # Assumes that you have set up kubectl contexts for the clusters.
    # To test the Cluster Mesh pod connectivity.
    cilium connectivity test --context <CLUSTER-1-CTX>
                            --destination-context <CLUSTER-2-CTX>
    
  2. Verifying that you can see pods with identities in both clusters:
kubectl exec -it -n kube-system $(kubectl get pods -n kube-system -l k8s-app=cilium -o jsonpath='{.items[0].metadata.name}') -c cilium-agent -- cilium-dbg node list | awk '{print $1}'

Name
cilium-001-master-0/cilium-001-master-0
cilium-001-master-0/cilium-001-node-0
cilium-002-master-0/cilium-002-master-0
cilium-002-master-0/cilium-002-node-0
  1. By curling ingress-nginx load balancer with load balancer IP or DNS:
    # Run either one of these two to get the IP.
    kubectl get ingress sample-http-ingress -o yaml
    openstack loadbalancer list
    # Use either IP or DNS to curl the system.
    curl <DNS-NAME> -v
    curl http://<LB-IP>:8080 -v
    
    # Expected output:
    # Hello from Cluster 001
    # Hello from Cluster 002
    

The Cluster Mesh connectivity refers to the ability of Cilium to route requests to pods in both clusters. It works non-deterministically, and if a pod in the local or remote cluster breaks, the requests are routed to the working cluster.

The failover was tested by downscaling the replicas of the API-service in one cluster, and then checking that the requests were routed to the other cluster. The failover worked as expected, and the requests were routed to the working cluster.

Troubleshooting

  • bpf.masquerade=true in the Cilium Helm configuration is required as stated here.
  • Cilium also offers an ingress controller. I experimented with it with the following YAML:
    bpf:
      masquerade: false
    ingressController:
      enabled: true
      hostNetwork:
        enabled: true
        nodes:
          matchLabels:
            role: ingress
        sharedListenerPort: 8010
      service:
        externalTrafficPolicy: null
        type: ClusterIP
    
    However, I encountered issues with the cluster networking, and it did not work as expected. I was able to query the ingress controller from within the cluster, but when I tried to enable the host network and query the node IP address, I did not get a response. Cilium ingress controller did not require the service.cilium.io/global-sync-endpoint-slices: "true" annotation, as it is already integrated with Cilium Cluster Mesh.

Next Up

In the next article, I will cover the Active-Passive setup for databases, we will cover setting up PostgreSQL, Valkey and OpenSearch with multi-cluster data replication. The goal is to ensure that databases are continuously replicated across clusters, providing redundancy and high availability in case of cluster failures.

References