Project Sunshine, Part 3: The Software

Max Bucknell

September 27, 2025

Well I built the bloody thing. After my new hard drives arrived, I installed them and realised that they cut off the air intake on my upside-down PSU. So I ordered a different power extension cable that bends the other way, and put it in the correct way. To do this I had to take apart basically the entire thing. I swapped the orientation of my CPU cooler fan, and rearranged some of the cables to be a bit neater, too. I’m super happy with it.

Now I can really think about the software I want to run on this thing. Before going into the specifics, I wanted to start out with my use case. What am I really here for? There are lots of things that I ultimately would like to host myself, but to start out with, I have identified some low-hanging fruit that will allow me to lay a foundation:

Nextcloud: A cloud storage application that is free for self-hosting. There are apps for every platform I care about, and this should allow me to replace Google Drive and iCloud Drive with my own infrastructure.
Jellyfin: The whole reason I bought an Intel CPU. I want to be able to store and serve my movie collection to my Apple TV without having to deal with streaming services. I also have a local music collection that is rather large, and Jellyfin will happily serve that, too.
macOS backups: It seems the correct way to support remote Time Machine backups these days is SMB, so very well: an SMB share for the three macOS devices in the household.
CI: I have been running my own Forgejo instance on my VPS for some time, and I’d like to have a few things build on it. That server is too wimpy to be useful for that, but this one? I think we could be onto a winner.

Summary

This post is long and dense, so I’m back up at the top now to outline the decisions I made, and where I am currently with this setup. Just in case you want to skip to the good parts:

I am running Fedora Server 42
I am booting from my NVMe SSD, which is formatted (after UEFI et al) with an LVM VG that manages a few logical volumes for my root, home, var)
I am running two 12TB HDDs as a mirrored VDev in a ZFS pool. No other VDevs for optimisation.
I am planning to use the ZFS pool as data storage, and I am running my services and database servers from the NVMe drive.
I’m very glad I’ve got 64GB of RAM for ARC.
This post does not discuss backup plans. I need to be at pains to point out that redundancy is not a backup, though it narrow the spectrum of scenarios in which you need to restore from one. I am actively developing an off-site backup plan, but I’m not quite ready with it yet. Hopefully nothing breaks in the next two months or so.

Bibliography

This post is not a primary source. I read and watched a lot of things to plan this, and it would be extremely remiss of me not to point you to the countless genuine experts who helped me make my decisions. I truly cannot thank them enough for taking the time to share wisdom on forums, Reddit, YouTube, blogs, and more. Below I’ve attached some of the articles and other materials that I found particularly useful while researching all of this. All of it affected this post, some in more diffuse ways than others. In every case, learning about this stuff has been an absolute joy, in huge part thanks to the efforts of the community.

Outside of any of the below sections, I have consulted the book Operating Systems: Three Easy Pieces. I originally bought this to teach me some stuff about locking, but it has shown its stripes here too, particularly chapter 42.

(Official) Docs

The entire System Administration page on OpenZFS’ official wiki is worth reading. It covers common workloads, a design overview, the works. Great reference, and not too long.
ArchWiki is a fantastic resource in general, and this is no exception. The articles on Partitions, File Systems, and RAID were particularly useful for me.
I spent a lot of time reading a lot of man pages, especially at first. In particular, filesystems(5) and lsblk(8) pointed me in the right direction.
The RHEL documentation is often directly relevant for Fedora, especially in cases like this. The entire chapter on LVM was super helpful.

YouTube Videos

Level1Tech has a lot of really great stuff, especially for novices. This intro to ZFS gave me some historical context (which I always find helpful), and really oriented me. Later on, I watched a video he made about Optane that I found super helpful.
From there, this (90 minute) conference talk by George Wilson and Matt Ahrens from Delphix gave me some firm background. It’s from 2018, but ZFS hasn’t changed that much in the intervening seven years.
45Drives have a huge number of excellent videos about ZFS. Their Tuesday Tech Tips series is full of surprisingly in-depth videos. Here are a few I particularly enjoyed: ZFS Read & Write Caching, Accelerating ZFS Workloads with NVMe Storage. 45 Drives also spent a good 40 minutes designing a ZFS pool for a mixed workload. The scale is not relevant to me, but the aims definitely were.
There’s a lot of old knowledge that has been cargo-culted around ZFS for a long time. This video from SpaceRex really helped clear up the value of L2ARC for me.
Perhaps the best thing that SpaceRex video gave me was a link to a truly excellent talk at SDC 2019: Best Practices for OpenZFS L2ARC in the Era of NVMe.
Lawrence Systems have also produced a number of good videos on ZFS. The introductory ZFS is a COW is very good, as is the much more recent conversation with Alan Jude: Is ZFS Ready for NVMe Speed?

Reddit and Forum Threads

The one thread I want to call out in particular is What are the bad things about zfs?. I think anyone considering ZFS should give this a thorough read.

Beyond that, the Level1Techs forums, the Lawrence Systems forums have been very valuable resources for specific questions. I haven’t yet had one that hasn’t already been asked.

Operating System

I decided pretty early on Linux. While I’m certainly BSD-curious, I just migrated to Linux on my desktop, and I like the idea of consistency. Furthermore, a familiar stack while I’m navigating a huge number of utterly new things is a welcome respite. Of course deciding on Linux is like deciding on “Asian Food” for dinner. There is no such thing, and the categorisation is so broad as to be almost useless. So let’s narrow it down a little. I come from the Red Hat tradition of Linux, in that I have typically run Fedora on clients and CentOS on servers. I know the tooling on these very well. I even quite enjoy SELinux. Having said that, I was not necessarily married to that world, either. I had a few

Fedora: familiar, and matches my desktop. Frequent updates (modern tools and annoyance of upgrading, well-maintained and well-supported.
Rocky Linux: Equally familiar. Longer term support, so less annoying and less modern, which is annoying in a different way.
Debian: Major new version just came out, so everything is pretty up to date. Community-driven, which is a huge tick.
Arch: A genuine consideration, since I plan to move to Arch on my desktop, and I enjoy consistency. I am also embracing control, which Arch grants to some extent.

I ruled out Arch fairly early on. I don’t want to be fiddling with the OS that much. I’m prepared to tolerate the at-least-annual updates of Fedora, because I know that they are relatively straightforward. But the level of tinkering required for Arch, and the ongoing management is difficult for me to justify for a server. I’d also like to unify the platform between my VPS and my home server. While Digital Ocean does support custom VM images, that seems like whole other kind of palaver that I don’t really want.

The next to go were Debian and Rocky. While Arch is a bit much, I already found frustration setting up a modern Rocky Linux server a few weeks ago. I really value having modern versions of things in my package manager. I’m not running a massive fleet here. It’s fine for me to be a little less conservative about what I install.

That left Fedora. It’s not exactly a ringing endorsement to simply be the last one standing, but I do genuinely enjoy this distribution. I like that everything in it is promptly updated, and I like the combo of firewalld and SELinux. I will also be leaning heavily on both Podman and Ansible, both of which are heavily supported by Red Hat, and are hence excellently supported by Fedora.

While the currently live version of Fedora is 42, the betas for 43 released only a few days before as I write this. I gave pretty serious consideration to starting there and saving myself that first update. I eventually decided not to. I felt that it was better to start out here and have the experience of an upgrade early in this system’s life, mostly to decide if I want to keep doing it.

Storage Configuration

I’m going to talk in a lot of detail about how I’ve chosen to arrange my storage. This is because there is a lot that goes into these decisions, and I find it really fascinating. I’ve learned a stunning amount about all this just in the last week or so, and I’m very excited to share it. I’m going to start out discussing the theory of how drives are arranged and organised, before running through how these are managed on a Linux system through the command line. Then I’ll talk about the file systems available to me, and why I made the choices I did.

Block Devices, Partitions, and Volumes

When I installed Linux for the first time ever, I had to partition my MacBook’s hard drive. That meant shrinking my HFS+ partition, and pre-allocating space for an Ext3 partition, as well as a swap partition. I was walked through this process with a very patient friend, and I did not understand what I was doing. Cut me some slack, I was but fourteen summers old. Things have changed a little since then, and for the better. This old 160GB HDD — and every stable storage device with which I’ve interacted with since — is exposed to Linux as a block device.

Interlude: Device Files

Unix loves the idea of presenting interfaces to things as files. Processes? Files. Streams? Files. Files? Believe it or not, also files. Devices are no exception to this rule. A device file is a special file that programmes can interact with through stdin/stdout syscalls mediated using a device driver. A device can be any number of things, like a printer, or a serial device, or a storage device. In Unix systems, device files are mounted under /dev.

Devices are varied, and the things they do are equally varied, however, they typically fall into a collection of categories. The operating system would like to be able to control as many classes of device as possible, and likewise devices would like to remain as OS-neutral as possible. To achieve this, device files were developed as a means of abstraction. There is a piece of software in the operating system that knows the details of how a device works. We call such a piece of software a driver. The other end of that driver — at least in Unix systems — is a device file. This exposes a standardised file-like interface to the operating system. This interface comes in one or both of two flavours: block, or character. The exact distinction has something to do with how they are read from and written to, but this has been enough of a diversion to me, someone who someday wants to set up their server.

In Linux specifically, storage devices, be they USB, SATA, NVMe, SAS, or anything else, are exposed as block devices, which is a raw way for the OS to interact with the device itself. To make this usable to a human, that device must be mounted. Mounting opens the device, interacts with it over a few standardised methods to work out the file system on the device, and then exposes that file system at a mount point, which is a directory. Different device types have different names under the /dev namespace. For example, my SATA drives will be /dev/sda and /dev/sdb, while my boot drive is /dev/nvme0n1.

Partitions

Getting back on topic, block devices can be further abstracted into partitions. A partition is a way for a single block device to expose multiple sub-block devices. On a SATA drive, these are numbered like /dev/sda1, or /dev/sda2, and so on. A partitioned drive allows you to run multiple file systems on that drive. This is essential for boot drives, because the boot process involves mounting increasingly sophisticated file systems until the operating system can be fully initialised. On secondary storage drives, partitioning is optional. As far as I can tell, you can just go ahead and format the entire storage device.

Partitions are pretty low level. To partition a block device, one must write a partition table at the start of it, which specifies regions that (oddly enough) partition the device. They are fixed in place. Their low level is actually a feature. They can be loaded by EFI firmware and BIOS. In fact, they are so low-level that you must choose the type of partition map based on whether your motherboard uses UEFI or BIOS. This very accessibility can cause ergonomic issues when trying to actually use them for storage management, though. If you’re anything like me, you can imagine higher level abstractions over these basic entities. If you’re anything like me, you’ve just motivated volumes!

Volumes

A volume manager is a tool that presents an abstraction over physical storage, and then re-presents block interfaces that can be formatted and mounted like drives or partitions. Typically, one feeds a volume manager physical volumes — which are either the storage devices themselves or a partition thereon — and exports volumes: a block interface that is mediated through the volume manager itself. Over partitions, volumes have several benefits. Chief among them is perhaps the ability to easily grow and shrink volumes. But it is also possible to have volumes share all of the available space on the physical media and use it as required, as opposed to having to segment that space ahead of time. Beyond that, volumes may also support using their physical devices in different ways to facilitate either higher throughput or redundancy.

The volume manager that I have on my server is called LVM2, which stands for Logical Volume Manager (2). It came by default with Fedora Server, and is a very common default in servers and on desktop. Out of the box, I was given a single volume group across (almost) my whole NVMe drive, and one logical volume was given to the OS for its root mount.

RAID

“Redundant Array of Independent Disks” (or RAID) is the name given to basically any technology that combines a collection of storage volumes into a single block device. This is done to improve performance, or to decrease the likelihood that one needs to restore from a backup. I am not going to go into detail on the different RAID levels and what they mean here, but as far as I can tell, all RAID (and RAID-like) implementations do a combination of three things:

Striping: When you write to a RAID device, it may split that write into multiple smaller ones and write them simultaneously to multiple physical devices, thus increasing your throughput. Likewise, when reading data, it must be read from multiple stripes as well.
Mirroring: When a device is mirrored, it means a write to that device must be replayed to multiple devices. As opposed to striping, this offers redundancy, but does not increase write throughputs, although it does allow data to be read faster.
Parity: When writing to a RAID device with parity, it will also write parity information somewhere. This allows restoration of data in the event of a physical failure without requiring the 1-1 storage loss of mirroring, and is typically mixed into larger arrays. The checksums combine data within the array in various ways, such that the data itself can be rebuilt from the checksums along with only sum of the data. In general, the more storage space you give to parity, the more sophisticated the checksums, and the more individual device failures you can survive.

In ye olde times, RAID was implemented on custom cards that slotted into servers. This is because parity computations were expensive enough — and the CPUs of the day slow enough — that it made sense to offload that calculation to a dedicated device. Today, those penalties are a lot less severe, and there are many, many software RAID implementations.

The classic implementation on Linux is mdadm, which constructs RAID arrays and exposes devices at /dev/md*. LVM2 also offers a RAID implementation, although this uses the standard tools provided by mdadm. In addition to the operating system and volume managers, more modern file systems also offer RAID. The file systems that do this also typically offer volume management, and that’s where this whole thing starts to get a lot muddier, from a conceptual point of view.

File Systems

A file system sits on top of a block storage device and exposes standardised APIs to the operating system to allow interacting with it in the way the OS expects. In Linux, the mount command is used to make a block storage device accessible at a certain location (or mount point). The file system will write a little bit of data in a standard place on disk to allow the device to be identified as belonging to that file system. After that, all bets are off because the file system has direct access to read and write whatever it likes from the device.

There are many file systems, and Linux supports more of them than most! For this story, I will be really focusing on three:

XFS: Been around forever, often a default choice on server distributions, including RHEL and Fedora. Old school and relatively basic, but that means it’s outlived basically all of its issues. RHEL say it’s stable for multi-petabyte workloads, which is more than I can say for any software I’ve ever built.
Btrfs: A newer, more modern file system that acts as a volume manager and RAID implementation. It’s the default on Fedora Workstation, but not on Server. Widely regarded as stable enough now, that has not always been true, and I’d avoid it for real production workloads. I’d have no issues with my use case. Fun fact, Btrfs stands for the B-TRee File System. This would stand it in good stead because I love b-trees, but it turns out basically all file systems use b-trees.
ZFS: The last word in file systems. If there’s a modern file system out there, it’s likely taken inspiration from this one. This was developed by Sun Microsystems, who open-sourced it before they were scavenged by Oracle. Today, there is OpenZFS, and a private Oracle product. There is also Linux drama around ZFS, which I understand is part of why Btrfs exists.

After a lot of reading, these were my three options. This is a lot of options considering that I came into this basically certain I was going to use Btrfs for everything. The true lesson to take away from my entire experience here is that I had my mind turned around on this issue several times over the week or so that it’s taken me to pull this post together. I cannot recommend highly enough the act of learning: reading articles, papers, docs; watching videos, tutorials, lectures, conference talks. It is impossible to know going into it what will and will not be relevant. Study widely, it will pay dividends.

A full treatment of the internals of every file system I mentioned above should probably be out of scope of this post, so I’ll just say that I am using XFS on my NVMe drive for the OS, and have configured my HDDs as a ZFS pool.

Why XFS?

The primary reason I selected XFS is because ZFS is not a true part of the Linux project. Its implementation lags a little in kernel support, and mounting a root drive in ZFS is just a bit more hassle than I’m ready for. Defaults are powerful, and I trust XFS to do what I need it to do. Having said that, I would prefer a more sophisticated file system with some more advanced features. Btrfs and ZFS both offer that, and I’ll come back to talk about why I’m using XFS for anything later on, in the “Why not ZFS?” section.

Interlude: Journaling vs Copy-on-Write

Writing to disk has a serious problem: it touches the real world, and the real world sucks. It’s all nice to sit in our theoretical world where computers work, but in the real world they only barely work. And it is important for a file system to not commit nonsense to disk in the event of a sudden loss of power. How should a file system do this?

It turns out that file systems — like databases — often use a write-ahead log to track the integrity of data in the case of a crash. When a write to disk is executed, the data and its metadata must be written to the journal beforehand, and only once safely recorded there can they be written to disk proper. If power is lost sometime, then the log can be replayed and data is not lost. The journal has defined start and end blocks, and is written strictly in order so that partial writes to the journal can be detected and ignored. In that case, the write simply failed.

This raises a fairly significant issue: do we not have to write data twice? Yes, actually. And unlike databases, this is not a log structured system where the log is in fact the record. The log is only a contingency, so that write must be flushed immediately. This problem is so severe that most journaling file systems I know of don’t write the data twice. Instead, they just write the metadata to the journal. This is what XFS does by default, and all file systems I know that do write data to the journal have options to turn it off.

Copy-on-Write is an alternative way to protect data from sudden and catastrophic failures. Under such a scheme, data is never overwritten and so cannot be corrupted. That is to say that every write that takes place is a write of new data. Once that write has been committed, a pointer can be atomically updated and the “update” is complete, and the old address can be cleaned up if nothing else is pointing to it.

One issue that faces our naïve copy-on-write implementation here is performance, especially over time. By forcing all writes to take place on new data, we are inviting terrible fragmentation issues, especially on HDDs where the cost of random reads and writes is so much steeper. There are a few solutions to this:

By far the easiest option is to not support HDDs! This is a choice Apple made with APFS, which is a CoW file system. They later added support for HDDs by including a defragmentation tool.
Another option is to batch writes in a faster medium, like main memory. This allows our file system to buffer writes, then batch them to make them more sequential.

As an aside-on-the-aside here, the block-based nature of copy-on-write file systems also makes it easier to support transparent compression, and convenient creation of snapshots. In that case, the old state of the tree needs to be preserved, and any objects it points to must not be evicted.

Introducing ZFS

ZFS is a copy-on-write file system, designed to safely store large amounts of data in ways that do not really require one to trust the hardware on which it is running. ZFS is awesome. Here are a few highlights:

ZFS uses integrated checksums to validate the integrity of each block.
Where possible, ZFS will use those checksums to restore data when it detects errors.
ZFS builds in a volume manager and RAID implementations.
ZFS transparently compresses data.
ZFS can encrypt data.
ZFS has built-in support for replication.

ZFS Architectural Concepts

ZFS exposes its data as a storage pool called a ZPool. This is what you mount, read from, write to, et cetera. Inside a ZPool are one or more virtual devices — notional storage devices — called VDevs. Each VDev is made up of a number of actual physical devices. Redundancy is configured at the VDev level. So we might have 6 HDDs in a RAID pattern exposed as a VDev. ZFS will stripe reads and writes across all of the ZPools data VDevs, with no redundancy in place. So if you want redundancy, it is very important to configure it at the VDev level.

I snuck a new term in to that paragraph without defining it. I specified “data VDevs”. When you add a new VDev to your ZPool, you can configure it to be used for a number of special purposes. Beyond using it to store data, all use cases support increasing performance. This is often necessary because ZFS is slow by design, owing to its thorough nature. To compensate for this, ZFS has a number of optimisations, some of which are optional and exposed through VDevs.

Beyond your storage architecture and multiplexing data access through stripes, the primary optimisation layer is an in-memory cache called the ARC, which stands for Adaptive Replacement Cache. This can use a lot of memory, but should only use memory that would otherwise go unused on the system, and even then only up to half of it. It is a fairly straightforward block cache of data that is read from disk. Operating Systems have this built-in, it’s called a page cache. ZFS has one because the file system knows its own implementation better, and so can provide a more specialised solution. In particular, ARC maintains a cache of most-recently-used blocks (MRU). This is very standard for caches of all kinds, but this implementation can be vulnerable to scans of blocks larger than the entire cache. In a database context, think about updating a value on all rows in a table. To combat this, ARC also maintains a most-frequently-used list, and to be evicted from the ARC, you have to fall off both lists. The memory within ARC is also used to batch writes to disk to increase throughput and response times.

However, ARC is not always enough, and that’s where non-data VDevs come in. In particular, a VDev may be configured as:

L2ARC: This is Level 2 ARC, and it’s basically just more ARC. In normal running, if there is an L2ARC VDev present, ARC is constantly feeding the entries at the back of its queue into the L2ARC. That way, if it gets evicted, it might be able to catch it in the L2. It seems that L2ARC is a little simpler than the ARC, in that it is a straightforward ring buffer, meaning that new entries bump the oldest from the cache.
SLOG: The write architecture I described above is what happens, but the write is only confirmed to the client when that write is tagged as asynchronous. For so-called synchronous writes, the data must be persisted to a stable storage medium before it can be acknowledged as saved. Writes are still batched, and making the client application wait for five seconds for the write buffer to be flashed is not really an option. So ZFS will write this data to a persistent journal called the ZIL (ZFS Intent Log). By default, this lives in the data array. It can be slow, and it gunks up other activity on disk, especially with HDDs because they can only do one thing at a time. A SLOG is a separate log device for the ZIL, allowing that stream to be written to more quickly, and hence dramatically speeding up synchronous writes.
Special VDev: When a Special VDev is present, metadata (and optionally very small files) are written there instead of to the main data array. This can have a huge impact on high metadata workloads, like list-heavy workflows.

These are all very cool, and each of them scratches a very particular nerd itch for me. However, they all also carry their own issues or complexities. These trade-offs have led me to integrate none of them into my home server right now. I am going to keep an eye on performance, and see what I can implement in the future to help alleviate issues that I do end up hitting.

There are a couple of issues that affect all three of these optimisations. First of all, each of these depend on having a storage device (or devices) that are considerably faster than HDDs. In 2025, I believed that this was either SSDs: either SATA or NVMe. However, NAND flash has a literal fatal flaw: it burns out when you write to it a lot. A lot of these workloads are write-heavy, and I am wary of burning out any drives I might use for these purposes. In the enterprise, there are actually specialised devices that don’t burn out, but I don’t feasibly have access to Optane or NVDimms at home. Of all of them, the SLOG is the one that gives me the most trouble here.

There are also some specific things to consider. For instance, L2ARC is useless if you don’t have data that is routinely falling out of ARC. I am anticipating very low workloads, and I have a lot of RAM considering my needs. ZFS includes excellent telemetry about ARC hits and misses that should allow me to diagnose issues that L2ARC might solve.

Special VDevs can be a huge accelerant for a number of workloads, especially for my SMB share and Nextcloud deployment. However, the metadata is the key to all the rest of the data in the storage pool! If that is lost or goes bad, the data on my HDDs is meaningless. Therefore, it becomes a potential point of failure for the entire ZPool. As a reminder, my server has two HDDs and one NVMe SSD. I’m not comfortable offloading a crucial component of my storage to only a single device.

While I am waiting, my current idea is to buy a couple of smaller SATA SSDs for my SLOG (these are very inexpensive), and a couple of larger ones for my Special VDev, arranged in some redundant way. This comes with a few snags, though. My motherboard only has four SATA ports, which requires that I buy a host bus adapter to add more. Adding four SSDs would also take up two of my drive enclosures, limiting me to only four HDDs. I think this would be okay, but it is worth noting. If I do take this route, it’s worth noting that these SSDs also suffer from the “fatal flaw” I mentioned above. I’m less worried about damaging separate devices that are not used for booting, especially when they are as cheap as they are.

Why not ZFS?

At this moment, I am deploying a mirrored VDev for my HDDs, and nothing else. Because I am lacking any other optimisation layers, there are some workloads that I am reticent to put on ZFS right now.

Chief among these is my database. I am planning to deploy a single PostgreSQL instance to store data for all of my services. I like PostgreSQL, and I don’t particularly care for running databases in containers, so this suits me. However, I want it to be performant and not give me headaches. Like any good database, PostgreSQL demands synchronous writes and so within ZFS today, my only option would be to have it completely dependent on the speed of my spinning drives. Given that I’ve got an NVMe drive just sitting there, this feels like a horrible waste. As such, I am keeping XFS on that drive and running my database workloads there, with frequent backups to my ZPool.

And that’s about it. If you read this entire thing, I hope you both enjoyed yourself and learned something. If not, I’m sorry! Although this post was very long, you have nobody to blame except yourself for reading it all. Thank you very much, and I’ll see you next time, when I’ll probably be writing about databases.