Kubernetes Services : Achieving optimal performance is elusive
This blog is to share with readers the experiences and experimentation results while deploying an on-premise Kubernetes service project.
The humble beginning
Due to certain management policies at work, it was decided to deploy this particular app as a cloud-native on-prem service. The DevOps guys jumped into this fantastic opportunity. All good — app was containerized , deployed using a Deployment. We chose a modest flannel-based networking model which did a great job. Kill one pod and another spawned automatically and networking worked flawlessly. At the same time, Dev team finished building the front-end client. Access to this service seemed easy : four nodes deployed, hence use four node-IP/node-ports combinations pairs to get access to this service. Nobody really bothered about introducing a real LB at this stage, since the consensus was that it could be easily handled at the client-side app.
The present
It was soon realized what a bad design call it was. At times, the node IP addresses changed for some reason, while at other times, some nodes were heavily loaded and yielded poor performance. The front-end client was soon transforming into an advanced load-balancing programming challenge. Luckily, better sense prevailed and someone suggested using service-type load-balancer like hyper-scalers do. MetalLB was a natural choice at the time. We already used kube-proxy in IPVS mode needed by MetalLB as part of our flannel setup. (For starters, kube-proxy is the duct-tape of K8s world. By default, it uses iptables which is notoriously difficult to scale but we used ipvs mode which supposedly has better numbers.)
The problem
The Dev team were quite satisfied not having to deal with load-balancing nightmare themselves. The so-called VIP of this service was fed to a local DNS server and they could then use named service to access the cluster. All good until some users kept complaining about sluggish performance at random intervals. If the same service was run as a monolithic app in a bare-metal server, the users got satisfactory performance each time. So, definitely something was wrong and amiss.
Going under the hood
It was really high time to look into the details and figure out what we might have overlooked. We setup a bare-bones K8s setup (for debugging) which had a single master and a single worker with only one instance (pod) of our application. The root-cause was soon discovered — Performance was always bad when VIP and selected LB end-point ended up in different nodes. VIP is usually assigned to a node based on Load-Balancer’s internal logic and this node serves all the incoming requests. The end-point Pods are simply selected on a round-robin basis or other policies as per IPVS rules. VIPs are floating in nature depending on node/pod fail-overs.
Simple iperf test showed around ~80% (4Gbps -> 600 Mbps) drop in performance based on how Pods are placed and selected by LB.
Exploring alternatives
Back to drawing board !! How about replacing flannel with some other CNI which has in-built service-type LB support ? Flannel served us well and its simplicity was perfect for our org. Why change something for no fault of its own. Frankly, at this time we needed something to compare against and draw some conclusions. Long story short, we decided to give LoxiLB a try. I experimented with it before for some blogs and had a relatively positive experience. Nonetheless no other option seemed viable enough. Overall, we chose LoxiLB’s in-cluster mode, which closely resembled our earlier setup. Frankly, we were pleasantly surprised at the results. LoxiLB is based on eBPF and it seemingly is able to bypass some bottlenecks encountered by MetalLB (IPVS).
There were some surprising finds here. Firstly, the performance jumped from ~600Mbps to 3Gbps in the worst case compared with MetalLB. Secondly, even in the optimal scenario, there was a gain of ~1Gbps as compared to MetalLB. Lastly, performance drop was still seen depending on how Pods were placed and selected but overall improvement of ~70% in system performance was achieved.
Conclusion
MetalLB is an awesome project but under-the-hood it uses Linux kernel’s ipvs as the actual load-balancer. ipvs was developed as an alternative to using iptables for load-balancing use-cases. But, eBPF is a game-changer for sure.
This post is not about declaring a winner in LoxiLB but rather how to trouble-shoot and use the right tools for one’s particular use-case(s). Transition to a cloud-native architecture does not automatically guarantee amazing performance and availability. Sound system-design principles are a must right from the beginning. Please visit this repo and follow the instructions to recreate this experimental test-setup.