Exploring Kubernetes Service Networking

#post

Service objects in Kubernetes are API objects used to expose pods to other pods, and the outside world. As the name suggests, Service objects allow you to treat a set of pods as a service so that they can accessed by a common name or port without knowing anything about the individual pods currently running.

In this post, I'll be exploring the Service objects in Kubernetes using my own bare-metal Raspberry Pi K3s cluster, and peeking under the hood at the fairly low-level Linux networking magic that makes it all possible. We'll see how each service type we define builds on the last, with ClusterIPs as the foundation.

ClusterIP Services

A ClusterIP Service is the default object Service type. It creates a proxy for internal communication between pods. If you think about a ClusterIP Service like a loadbalancer/proxy appliance, it's a bit simpler to understand. Say for example we have front-end Pods that need to talk to our back-end Pods, our front-end Pods send requests to the load-balancer's virtual IP address and the load-balancer then proxies the requests to one of the back-end pods in the pool. You might ask why we couldn't just configure our front-end Pods to talk to our back-end Pods directly by IP address. Well first, we could have multiple back-end Pods at any given time. Second, remember that Pods are ephemeral and their IP addresses aren't predictable. If our back-end pod dies and a new replica is created, it would likely have a different IP address. If our Pods were traditional servers instead of Pods in this scenario, we'd use a load-balancer and point all of the front-end servers to a single virtual IP address on the load-balancer. The load-balancer would then proxy the connections to the various back-end servers in the pool. This is exactly what a ClusterIP is; It's a virtual IP address (and DNS record) used inside the Cluster to access resources on other Pods.

To see how a ClusterIP service works in action, I created a simple NextCloud Deployment with two Pods and a matching ClusterIP Service.

# nextcloud-dev.yaml
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nextcloud-dev
  labels:
    app: nextcloud
    environment: dev
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nextcloud
      environment: dev
  template:
    metadata:
      labels:
        app: nextcloud
        environment: dev
    spec:
      containers:
      - name: nextcloud
        image: nextcloud:latest
        ports:
        - containerPort: 80

# nextcloud-dev-clusterip
---
apiVersion: v1
kind: Service
metadata:
  name: nextcloud-dev-clusterip
  labels:
    app: nextcloud
    environment: dev
spec:
  selector:
      app: nextcloud
      environment: dev
  type: ClusterIP
  ports:
    - port: 8080
      targetPort: 80

Next, I applied the Deployment and Service and confirmed everything was up.

brian@ansible-vm:~/k3s$ kubectl apply -f nextcloud-dev.yaml 
deployment.apps/nextcloud-dev created
brian@ansible-vm:~/k3s$ kubectl apply -f nextcloud-dev-clusterip.yaml 
service/nextcloud-dev-clusterip created
brian@ansible-vm:~/k3s$ kubectl get pods,svc -o wide
NAME                                 READY   STATUS    RESTARTS   AGE   IP            NODE                NOMINATED NODE   READINESS GATES
pod/nextcloud-dev-64c9559b8c-qgjbp   1/1     Running   0          23s   10.42.1.170   k3s-worker-rpi002              
pod/nextcloud-dev-64c9559b8c-pc296   1/1     Running   0          23s   10.42.0.85    k3s-master-rpi001              

NAME                              TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)    AGE     SELECTOR
service/kubernetes                ClusterIP   10.43.0.1             443/TCP    4d      
service/nextcloud-dev-clusterip   ClusterIP   10.43.7.203           8080/TCP   9s   app=nextcloud,environment=dev

From the above output, we can see that the Pods are started on different nodes and the Cluster IP 10.43.7.203 on port 8080 was allocated for the Service. To test accessing the service from another Pod inside the cluster, I of course, needed another Pod. So, I started a Debian pod interactively to give me access to a shell and used the curl command to successfully download the NextCloud login page by using both the Cluster IP address, and the DNS name of the service.

brian@ansible-vm:~/k3s$ kubectl run debian-bash --rm -i --tty --image debian bash
If you don't see a command prompt, try pressing enter.
root@debian-bash:/# ip addr | grep 10.42
    inet 10.42.1.171/24 brd 10.42.1.255 scope global eth0
root@debian-bash:/# apt update
<...>
root@debian-bash:/# apt install curl
<...>
root@debian-bash:/# curl http://10.43.7.203:8080 > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  6815  100  6815    0     0   141k      0 --:--:-- --:--:-- --:--:--  144k
root@debian-bash:/# curl http://nextcloud-dev-clusterip:8080 >> /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  6815  100  6815    0     0   144k      0 --:--:-- --:--:-- --:--:--  144k

Great! The login page was downloaded in both test cases. To recap what happened, I sent a request to the ClusterIP address on port 8080 and the request was proxied to one of the nextcloud-dev Pods on port 80. While this might all makes sense in theory, we know that there's no actual load-balancer appliance being used in our cluster so how is this all actually implemented? It's implemented using a process called kube-proxy and some good ol' fashioned iptables magic, of course... No, seriously.

The kube-proxy process runs on each node and acts as the loadbalancer/proxy for Kubernetes Services. If kube-proxy is run in userspace mode, an iptables rule is created to forward incoming connections for Services to the userspace kube-proxy process. The kube-proxy process then proxies the incoming connections to the appropriate destinations. If kube-proxy is run in kernelspace mode, iptables rules are created to directly proxy connections using netfilter within kernelspace.

Once again, I wanted to see this in action. Since kube-proxy is running in Kernelspace mode in my cluster I was able to see how the proxy works by listing the iptables NAT table. I did this on k3s-worker-rpi002 as shown below. I rearranged and removed some of the output to make it easier to read.

pi@k3s-worker-rpi002:~ $ sudo iptables --table nat -L -n
Chain PREROUTING (policy ACCEPT 43 packets, 6643 bytes)
 pkts bytes target     prot opt in     out     source               destination         
 8229 1258K KUBE-SERVICES  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes service portals */
<...>

Chain KUBE-SERVICES (2 references)
 pkts bytes target     prot opt in     out     source               destination         
<...>
    0     0 KUBE-MARK-MASQ  tcp  --  *      *      !10.42.0.0/16         10.43.7.203          /* default/nextcloud-dev-clusterip: cluster IP */ tcp dpt:8080
    0     0 KUBE-SVC-WNPCWQYGSYD7A5N6  tcp  --  *      *       0.0.0.0/0            10.43.7.203          /* default/nextcloud-dev-clusterip: cluster IP */ tcp dpt:8080
<...>
Chain KUBE-MARK-MASQ (23 references)
 pkts bytes target     prot opt in     out     source               destination         
    0     0 MARK       all  --  *      *       0.0.0.0/0            0.0.0.0/0            MARK or 0x4000
<...>
Chain KUBE-SVC-WNPCWQYGSYD7A5N6 (1 references)
 pkts bytes target     prot opt in     out     source               destination         
    0     0 KUBE-SEP-REGCNUNS5RWEMTHB  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* default/nextcloud-dev-clusterip: */ statistic mode random probability 0.50000000000
    0     0 KUBE-SEP-AB7UW27KCQCMC3UG  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* default/nextcloud-dev-clusterip: */
<...>
Chain KUBE-SEP-REGCNUNS5RWEMTHB (1 references)
 pkts bytes target     prot opt in     out     source               destination         
    0     0 KUBE-MARK-MASQ  all  --  *      *       10.42.0.85           0.0.0.0/0            /* default/nextcloud-dev-clusterip: */
    0     0 DNAT       tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            /* default/nextcloud-dev-clusterip: */ tcp to:10.42.0.85:80
<...>
Chain KUBE-SEP-AB7UW27KCQCMC3UG (1 references)
 pkts bytes target     prot opt in     out     source               destination         
    0     0 KUBE-MARK-MASQ  all  --  *      *       10.42.1.170          0.0.0.0/0            /* default/nextcloud-dev-clusterip: */
    0     0 DNAT       tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            /* default/nextcloud-dev-clusterip: */ tcp to:10.42.1.170:80

To simplify this even more, I walked through the effective steps the request above would have taken.

Step	Chain	Effective Matched Rule	Action
1.	PREROUTING	Source: ANY Destination: ANY	Jump to KUBE-SERVICES
2.	KUBE-SERVICES	Source: Any Pod IP Destination: The ClusterIP & Port	Jump to KUBE-SVC-WNPCWQYGSYD7A5N6
3.	KUBE-SVC-WNPCWQYGSYD7A5N6	Source: ANY Destination: ANY	Jump to either: KUBE-SEP-REGCNUNS5RWEMTHB or KUBE-SEP-AB7UW27KCQCMC3UG (with 50% probability)
4.	KUBE-SEP-REGCNUNS5RWEMTHB / KUBE-SEP-AB7UW27KCQCMC3UG	Source: Not the destination Pod Destination: ANY	DNAT to the destination Pod IP on TCP port 80

In summary, this example shows how connections sourced from within the cluster destined for the ClusterIP Service's address and port, ultimately get destination NATed to one of the Pods matched by our Service definition. This is implemented via the kube-proxy process on each node by using iptables and consequently netfilter (when run in kernelspace mode).

NodePort Services

While ClusterIP Services enable connections between Pods inside a cluster, NodePort Services enable connections from outside the cluster. They do this by mapping an externally reachable port on each node to a ClusterIP. Due to their simplicity, NodePorts are ideal for testing or development. However, they have a few limitations:

The available port ranges exposed on each node must be between 30000–32767
Only one Service can be used per exposed NodePort
Directly accessing an application by using a node IP or DNS name, isn't as resilient as other options

To test out using a NodePort, I created this simple Service definition.

# nextcloud-dev-nodeport.yaml
---
apiVersion: v1
kind: Service
metadata:
  name: nextcloud-dev-nodeport
  labels:
    app: nextcloud
    environment: dev
spec:
  selector:
      app: nextcloud
      environment: dev
  type: NodePort
  ports:
    - port: 80

Next, I removed the ClusterIP Service I created earlier and applied the new NodePort Service.

brian@ansible-vm:~/k3s$ kubectl delete service nextcloud-dev-clusterip
service "nextcloud-dev-clusterip" deleted
brian@ansible-vm:~/k3s$ kubectl apply -f nextcloud-dev-nodeport.yaml 
service/nextcloud-dev-nodeport created
brian@ansible-vm:~/k3s$ kubectl get pods,svc -o wide
NAME                                 READY   STATUS    RESTARTS   AGE    IP            NODE                NOMINATED NODE   READINESS GATES
pod/nextcloud-dev-64c9559b8c-qgjbp   1/1     Running   0          3d5h   10.42.1.170   k3s-worker-rpi002              
pod/nextcloud-dev-64c9559b8c-pc296   1/1     Running   0          3d5h   10.42.0.85    k3s-master-rpi001              
pod/debian-bash                      1/1     Running   0          10h    10.42.1.171   k3s-worker-rpi002
              
NAME                             TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)        AGE   SELECTOR
service/kubernetes               ClusterIP   10.43.0.1     <none>        443/TCP        9d    <none>
service/nextcloud-dev-nodeport   NodePort    10.43.3.110   <none>        80:32540/TCP   54s   app=nextcloud,environment=dev

After applying the NodePort Service, we see above that the NodePort Service created a ClusterIP of 10.43.3.110 and the internal TCP port 80 is mapped to the external port 32540 which is exposed on each node. With this Service, I was able to access the NextCloud login page by using the IP address of any node in the Cluster over TCP port 32540.

Screenshot-from-2020-06-23-06-47-27.png|NextCloud login page with NodePort

So, a NodePort effectively just maps an external port on each node to an internal ClusterIP Service within the cluster. That's great, but requiring users to access an application by typing a high port number in their browser isn't a great user experience. Plus, if we direct users to only one of the nodes in this manner and that node fails or is rebooted, there would be an outage. To avoid these issues, we'd really need have a load-balancer appliance installed in front of the cluster to balance traffic across all the nodes. Well, Kuberenetes has that covered too.

LoadBalancer Services

A LoadBalancer Service in Kubernetes is exactly what I just described. It simply provisions an external load-balancer that balances traffic across all of the cluster nodes where a NodePort is created. In cloud environments, the external load balancer is provisioned by Kubernetes via the respective cloud-provider's APIs. In this way, LoadBalancer Services can be defined just like any other Service within Kubernetes, irrespective of which public cloud it's hosted in.

In bare-metal clusters like my Raspberry Pi cluster, however, there's no cloud provider that Kubernetes can interact with to provision a load-balancer for us. Instead, the best option is to use an external load-balancer called MetalLB that integrates with Kubernetes. The MetalLB website explains this further:

Kubernetes does not offer an implementation of network load-balancers (Services of type LoadBalancer) for bare metal clusters. The implementations of Network LB that Kubernetes does ship with are all glue code that calls out to various IaaS platforms (GCP, AWS, Azure…). If you’re not running on a supported IaaS platform (GCP, AWS, Azure…), LoadBalancers will remain in the “pending” state indefinitely when created.

MetalLB is designed for bare-metal Kubernetes deployments and installs directly in the cluster. To install MetalLB, I followed the installation by manifest steps in the MetalLB documentation.

brian@ansible-vm:~/k3s$ kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.9.3/manifests/namespace.yaml
namespace/metallb-system created
brian@ansible-vm:~/k3s$ kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.9.3/manifests/metallb.yaml
podsecuritypolicy.policy/controller created
podsecuritypolicy.policy/speaker created
serviceaccount/controller created
serviceaccount/speaker created
clusterrole.rbac.authorization.k8s.io/metallb-system:controller created
clusterrole.rbac.authorization.k8s.io/metallb-system:speaker created
role.rbac.authorization.k8s.io/config-watcher created
role.rbac.authorization.k8s.io/pod-lister created
clusterrolebinding.rbac.authorization.k8s.io/metallb-system:controller created
clusterrolebinding.rbac.authorization.k8s.io/metallb-system:speaker created
rolebinding.rbac.authorization.k8s.io/config-watcher created
rolebinding.rbac.authorization.k8s.io/pod-lister created
daemonset.apps/speaker created
deployment.apps/controller created
brian@ansible-vm:~/k3s$ kubectl create secret generic -n metallb-system memberlist --from-literal=secretkey="$(openssl rand -base64 128)"
secret/memberlist created

To test out using a LoadBalancer Service, I first needed to disable the network service LoadBalancer that shipped with K3s. To do that, I edited the systemd unit file on the master node at /etc/systemd/system/k3s.service and added the --disable servicelb option to the ExecStart directive, then restarted K3s. After restarting, I confirmed that the servicelb was no longer running.

pi@k3s-master-rpi001:~ $ sudo nano /etc/systemd/system/k3s.service
pi@k3s-master-rpi001:~ $ sudo tail -3 /etc/systemd/system/k3s.service
ExecStart=/usr/local/bin/k3s \
    server \
    --disable servicelb
pi@k3s-master-rpi001:~ $ sudo systemctl daemon-reload
pi@k3s-master-rpi001:~ $ sudo systemctl restart k3s
pi@k3s-master-rpi001:~ $ exit
logout
Received SIGHUP or SIGTERM
Connection to k3s-master-rpi001 closed.
brian@ansible-vm:~/k3s$ kubectl get all -n kube-system
NAME                                         READY   STATUS      RESTARTS   AGE
pod/helm-install-traefik-dprrp               0/1     Completed   2          11d
pod/metrics-server-7566d596c8-5jhqw          1/1     Running     9          11d
pod/local-path-provisioner-6d59f47c7-w6tq5   1/1     Running     58         11d
pod/coredns-8655855d6-brz7h                  1/1     Running     9          11d
pod/traefik-758cd5fc85-m7kvw                 1/1     Running     10         11d

NAME                         TYPE           CLUSTER-IP      EXTERNAL-IP     PORT(S)                      AGE
service/kube-dns             ClusterIP      10.43.0.10                53/UDP,53/TCP,9153/TCP       11d
service/metrics-server       ClusterIP      10.43.143.113             443/TCP                      11d
service/traefik-prometheus   ClusterIP      10.43.243.252             9100/TCP                     11d
service/traefik              LoadBalancer   10.43.212.219   192.168.2.114   80:32452/TCP,443:32320/TCP   11d

NAME                                     READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/metrics-server           1/1     1            1           11d
deployment.apps/coredns                  1/1     1            1           11d
deployment.apps/local-path-provisioner   1/1     1            1           11d
deployment.apps/traefik                  1/1     1            1           11d

NAME                                               DESIRED   CURRENT   READY   AGE
replicaset.apps/metrics-server-7566d596c8          1         1         1       11d
replicaset.apps/coredns-8655855d6                  1         1         1       11d
replicaset.apps/local-path-provisioner-6d59f47c7   1         1         1       11d
replicaset.apps/traefik-758cd5fc85                 1         1         1       11d

NAME                             COMPLETIONS   DURATION   AGE
job.batch/helm-install-traefik   1/1           73s        11d

Next, I once again created a new Service object below and then deleted the NodePort Service from earlier, before applying the new one.

brian@ansible-vm:~/k3s$ kubectl delete service nextcloud-dev-nodeport
service "nextcloud-dev-nodeport" deleted
brian@ansible-vm:~/k3s$ nano nextcloud-dev-loadbalancer.yaml 
brian@ansible-vm:~/k3s$ cat nextcloud-dev-loadbalancer.yaml 
# nextcloud-dev-loadbalancer.yaml
---
apiVersion: v1
kind: Service
metadata:
  name: nextcloud-dev-loadbalancer
  labels:
    app: nextcloud
    environment: dev
spec:
  selector:
      app: nextcloud
      environment: dev
  type: LoadBalancer
  ports:
    - port: 80
brian@ansible-vm:~/k3s$ kubectl apply -f nextcloud-dev-loadbalancer.yaml 
service/nextcloud-dev-loadbalancer created
brian@ansible-vm:~/k3s$ kubectl get services -o wide
NAME                         TYPE           CLUSTER-IP     EXTERNAL-IP   PORT(S)        AGE     SELECTOR
kubernetes                   ClusterIP      10.43.0.1      <none>        443/TCP        11d     <none>
nextcloud-dev-loadbalancer   LoadBalancer   10.43.208.81   <pending>     80:31114/TCP   2m27s   app=nextcloud,environment=dev

The LoadBalancer Service started and we can see the Cluster-IP and ports above. However, the External IP is stuck in pending. That's because I didn't define a pool of IPs for MetalLB to use. This is done with a ConfigMap as explained in the MetalLB documentation.

brian@ansible-vm:~/k3s$ nano metallb-config.yaml
brian@ansible-vm:~/k3s$ cat metallb-config.yaml
# metallb-config.yaml
---
apiVersion: v1
kind: ConfigMap
metadata:
  namespace: metallb-system
  name: config
data:
  config: |
    address-pools:
    - name: default
      protocol: layer2
      addresses:
      - 192.168.2.200-192.168.2.225

brian@ansible-vm:~/k3s$ kubectl apply -f metallb-config.yaml 
configmap/config created
brian@ansible-vm:~/k3s$ kubectl get svc -o wide
NAME                         TYPE           CLUSTER-IP     EXTERNAL-IP     PORT(S)        AGE   SELECTOR
kubernetes                   ClusterIP      10.43.0.1      <none>          443/TCP        11d   
nextcloud-dev-loadbalancer   LoadBalancer   10.43.208.81   192.168.2.200   80:31114/TCP   12m   app=nextcloud,environment=dev

As shown above, after applying the ConfigMap, the external IP was now populated with the first IP in the defined range, 192.168.2.200. Using the external IP and port, I was now able to access the NextCloud login page without directly accessing a NodePort on a single node.

Screenshot-from-2020-06-24-08-19-46.png|NextCloud login page with ExternalIP

With MetalLB configured in layer 2 mode, the way this works is a bit interesting. It's explained very well in the MetalLB documentation which states:

In layer 2 mode, one node assumes the responsibility of advertising a service to the local network. From the network’s perspective, it simply looks like that machine has multiple IP addresses assigned to its network interface.

What that effectively means is that in layer 2 mode, there's actually no network load-balancing occurring across the nodes. One node is elected the leader which responds to ARP requests for the IP addresses in the pool (implemented using VRRP). In this way, all the traffic is received on a single node, but if that node fails,a different node becomes the leader. So what we are getting in layer 2 mode is some much needed redundancy, and some additional DNAT to hide our NodePort's high port range. For a home cluster, this is a simple and effective solution that works well. In layer 3, or BGP mode, MetalLB uses relies on an external BGP router to do the load-balancing based on Equal Cost Multi-Pathing (ECMP). Since I'm using a MikroTik router to connect the nodes in my cluster, this may be something I explore more later. For now though, layer 2 mode supports my needs well.

Services Summary

To recap, we've seen how each of the Service types we've talked about builds on the last. ClusterIPs are at the foundation and are used to proxy connections within the cluster. This is implemented with the kube-proxy process which runs on every node. NodePorts map internal ClusterIPs to external ports on each node. In this way, they expose pods to the outside world, but offer no redundancy and can only expose a range of high ports. LoadBalancers solve the problems of NodePorts, by distributing traffic across all the nodes where NodePorts are exposed. To visualize these concepts and make them more concrete, consider this diagram showing how a ClusterIP and NodePort are used when a LoadBalancer Service is created.

kubernetes_at_home_services_diagram.png|Services summary diagram