A hyperconverged minilab cluster with Ceph and K3s - part 1
My current pet project, in addition to the rest of my homelab, is a small rack to explore the, really, absurdly complicated world of ‘hyperconverged.’ It’s a silly marketing buzzword that simply means ‘getting a single piece of hardware to run as many things as possible, without being an insecure and flaky monolith.’
Well, I’ve got plenty of experience with insecure and flaky monoliths, so as the old saying goes, ‘how hard can it be?’
A little backstory
The funny thing here is that I started this as two separate projects. On the one hand, I wanted to learn about Ceph. For those who haven’t encountered it yet, Ceph is a clustered storage system based on commodity x86-64 hardware, and it is both extremely high performance and extremely redundant. Where a RAID is designed to tolerate the loss of a HDD, Ceph is designed to tolerate the loss of an entire server. With enough storage devices, Ceph is capable of maxing out high-through (40Gbps and above) network links. One of the largest Ceph clusters in existence supports CERN - one of my previous jobs was working for a GridPP Tier 1 provider, where the data gathered from the LHC would be shipped to us and stored first on tape, then on a 1,000-machine Ceph cluster that, when I left a few years ago, was approaching 80PB of disk space. Any two servers could fall over simultaneously and the systems reading from the cluster wouldn’t notice. When you’re into Petascale, moving away from monolithic machines makes an awful lot of sense, so Ceph has been on my to-learn list for a long time.
The second project was Kubernetes. Now, I’m not the biggest fan of containers - whilst they are an absolute boon for development as it means you can spin up any conceivable dev environment and have it remain completely unchanged even while the OS around it evolves, I dislike the trend of releasing applications exclusivly as containers. I totally understand why - for the same reason as development, it gives the maintainer full control of all libraries and runtimes. But that’s a double-edged sword - containers are designed to be static. Libraries have security holes and need to be updated. Docker traditionally ran not just the application in the container as root but the container engine itself as root, so container escapes had the potential to take over the host; this has changed a lot with rootless Docker, but the fact that this was the standard for years does speak to the design. And finally, with everything being released as containers, the developer makes certain assumptions about your environment - you have local storage on the container runner, for example. Woe betide you if you want to use NFS shares - getting UIDs and GIDs to line up is an exercise in futility. Props to those projects that give you a Development instruction page that doesn’t depend on Docker - I run a couple of such projects in Dev mode in a VM, outside a container, and for the most part it works. But others, I’ve found them absolutely impossible to run outside a container for myriad, mind-bending reasons (I’m looking at you, AWX!).
But that’s a sidetrack. Anyway, love or loathe them, containers have advantages. Kubernetes, of course, was created by someone who looked at Docker and thought, ‘hey, this isn’t nearly complicated enough!’ The layers of abstraction required make my head spin - I run out of fingers trying to count the layers between the application and its storage!! There’s many projects out there that try to simplify pure Kubernetes while distilling the fundamentals, which on their own are quite intriguing - a self-maintaining, self-healing application cluster that can scale applications as needed for the workload, with many different storage options and automatic internal routing so it really doesn’t matter where any container is running. Loss of a processing node is as inconsequential as Ceph - the workload is brought up on a different node. I liked the idea but the absurd complexity was too much, before I found K3s - for those who don’t get the name, Kubernetes is commonly abbreviated K8s, which means ‘K-eight letters-s’. That took me a while to figure out too, and is the same as ‘Internationalisation’ being abbreviated as ‘i18n’ - the number means ‘too many frikkin’ letters!’ So anyway, K3s is cut-down K8s - the 8 is cut in half to a 3!
When I started looking at K3s, I could have run a bunch of VMs, but I didn’t really want to - K3s seems to have its greatest advantage ‘on the metal’ because it makes the best use of hardware. I also didn’t really want to add yet another abstraction layer to an already considerable call stack! No matter how well the engine handles these layers, my mental model doesn’t! Now, as Raspberry Pi’s began to increase in price around 2020, I started collecting Dell Wyse 3040s. These are neat little thin clients which are actually full x86 PCs underneath, about the size of a deck of cards. They run on 5V and can be powered by USB. They have gigabit ethernet and USB3.0, and can output 2 displays. They’re passive cooled. What’s not to love? Well, they do only have 2GB of DDR3 and 8GB of eMMC, which is a bit limiting, but they draw just 1-2W at idle, which is honestly better than current Pi’s. Power consciousness is unavoidable in a homelab, after all.
At the time, I was making use of iSCSI for my lab’s primary storage. So I thought, why not use that as the storage backend? Unfortunately, my desire to not use e.g. NFS because it would require some kind of static server (which would defeat the point of a cluster) led me down a path of GFS2. If I thought K8s was hard, GFS2 redefined for me what ‘difficult’ is - GFS2 is a clustered filesystem, which means it can be mounted on multiple hosts simultaneously. Doesn’t sound difficult, right? Except that disk filesystems are specifically not designed to do this - the host assumes it has the ‘canonical’ view of a local filesystem and nothing else can change its view of what state the filesystem is in. Local filesystems like ext4 don’t have a mechanism to refresh the local view - they have to be unmounted and remounted for that. Sure, multiple systems can read from a shared block device no problem, but when one of them starts writing to it, there’s no way to tell the others that something just changed. GFS2 handles this with a bunch of additional daemons, but I never got it to work properly; by the time I realised just how much was involved in making a clustered filesystem work, I had so many supporting daemons running that there honestly wasn’t much RAM left for actually doing anything with it.
So that got shelved and the whole idea stalled because I didn’t have a storage backend for it.
And I am too embarrassed to admit how long it took me to join the dots here.
Ceph + K3s
On paper, a perfect match. Ceph is based around the idea of Replicated Block Devices - any change on one node is replicated to the others. Under the hood, Ceph uses a bunch of open-source projects such as LVM and DRBD to ensure any storage volume is available from any node. On top of this, you can either use Ceph’s native storage driver on *nix systems, or add overlays that emulate more traditional network storage like NFS and SMB (the latter still a WIP) so the client does not need to be aware of Ceph, or even use S3-compatible object storage via its integrated gateway. It’s pretty flexible.
Ceph maintains a native storage driver for Kubernetes, too. The driver adheres to the Container Storage Interface (CSI) spcification so it can be automatically allocated, mounted and destroyed by the cluster. There are two forms available - RBD and CephFS. RBD is the simplest, though it may be more limited depending on your use case - it will create a fixed-size block storage device with its own filesystem (e.g. ext4) and give that to the container. CephFS adds an abstraction layer allowing containers to simply write files directly to it, but it comes at a complexity cost on the cluster side; for CephFS to function in a cluster that only deals in block devices, you need to add Metadata Servers (mds). These are addtional daemons that store the filsystem state. A disadvantage is that if all mds daemons go offline, CephFS is dead and unusable, and if my understanding is correct, if you lose the daemons completely, you lose all data in CephFS. For this reason, I went with RBD storage for my container storage pool.
While digging into this, I discovered something almost by accident that made this whole setup much more practical. As the 3040s only have 8GB of onboard eMMC storage (literally enough for their intended use of booting a tiny Linux environment to connect onwards to a real machine), that really isn’t a lot for a full OS. Debian can be squeezed into it, maybe 3 or 4 GB with a comfortable amount of tools available to you, but that leaves a problem. Containers still need to be stored on disk somewhere. And for an effective cluster, what’s left over after installing the K3s engine really isn’t enough space for more than 1 or 2. Well, remember how I make use of iSCSI in my lab setups? Turns out Ceph does this as well. And it’s actually brilliant at it.
Ceph’s iSCSI implementation is surprisingly easy to use - far more intuitive than SCST on plain Devuan, that’s for sure!! You create a Replicated pool, initialise it for RBD use, set up the gateway daemons (very little config needed) and then everything is configured through a curses interface. Unlike my previous experiences with iSCSI which made LUNs available to all connected clients, the Ceph setup specifically ties LUNs to iSCSI initiators, each of which has its own CHAP authentication. Therefore, whenever one initiator (client) connects to a target (storage server), it only ever sees whatever storage block it’s been assigned. And the icing on the cake here? iSCSI LUNs are fully replicated between Ceph nodes just like regular Ceph storage, and it does native multipathing without any additional fuss. So if you configure an iSCSI gateway on each of the Ceph storage nodes, you can connect a client to any one of them to access its assigned storage, make whatever changes you like, fail a storage node and, on paper, the client will fail over to a different storage node with no loss of data.
Initially I figured the best approach would be to install the OS onto the eMMC as before, then create 50GB iSCSI LUNs per worker and mount that at /var/lib/rancher (the K3s working location). And while this did work, I realised when Debian 13 was released that there wasn’t enough eMMC space to actually upgrade the OS - there literally wasn’t enough free space left to download the upgrade packages, even with K3s on a separate disk. So the first thought was to create another LUN to give me some additional space to do the OS upgrade, but I also remembered that eMMCs are more or less SD cards - they don’t tolerate extensive writes and will wear out in short order, which isn’t ideal for a workload that’s going to be logging extensively.
So then came the lightbulb moment - Debian can boot from iSCSI, right? What if I put all this together and boot the workers directly from the Ceph cluster? Thus turning them into little worker bees that have minimal local state and can be swapped out as needed. All the important state is on the Ceph cluster, which handles replication and failover, and I can expand it if I need to.
And as you can probably tell because I’m having to write a blog about it, it was a huge effort, but it is possible and against the odds it does work!
Up next - hardware and design.