Project Sunshine, Part 5: Backing It Up
2026-05-22
I want to open this post with what I decided to do. The narration, embellishment, and context for these decisions comes at the end.
My home server (Orca) runs a boot volume formatted as XFS, and sports two hard drives in a mirrored ZFS pair. All of my ZFS datasets are encrypted with encryption keys stored on the root volume, and the root volume is encrypted via LUKS. When it boots up I have to enter a passphrase for that root volume, and the passphrases for the ZFS datasets are stored in files on the boot volume. I have a PiKVM for entering the passphrase (and lots of other things).
My primary remote backup is a BeeLink ME Mini PC (Oyster) that lives with my parents in the UK. This runs FreeBSD on its onboard eMMC, and I have a ZFS storage pool there comprising three 4TB NVMe devices in RAIDZ1 arrangement. Oyster does not really store any data on its own, but rather acts as a receive-target for Orca’s ZFS storage pool. Replication from Orca to Oyster is managed through the zrepl utility. It is configured to run hourly, keep 24 hourly backups, and 30 daily backups.
In addition to the continuous online remote backup, I also have a drive dock that holds a HDD connected to Orca over USB. Every two weeks, my ZFS datasets are replicated to that drive, and it is taken to a friend’s house over sneakernet, where I swap it for its counterpart, which gets plugged in when I am back home. I had initially wanted to use tape for this (because it’s what tape is for and there is an undeniable ”cool” factor there for me), but honestly if you’ve seen the prices of LTO tape drives, you’ll understand why I didn’t do that. They’re not aimed at home users.
That accounts for ZFS replication. An operating assumption for the rest of this is that once something is on Orca’s spinning rust, it’s safe. It’s internally redundant, and it’s replicated to two different places, one of which is on a different continent.
However, as I mentioned above, ZFS is not the whole story of Orca, let alone my entire self-hosted universe. While my process so far doesn’t solve that problem, it does reduce it to the problem of getting data into my main ZFS pool. And my solution to that problem is Restic. I have Restic configured to run hourly, and to back up a collection of target directories onto the Zpool.
It’s not as simple as 3-2-1
You can’t get very far in self-hosting without hearing about how your backups should always use the 3-2-1 method. One’s backups are 3-2-1 compliant if:
- (3) There should be at least three copies of the data,
- (2) stored on two different types of storage media,
- (1) and one copy should be kept offsite, in a remote location.
If you do this, you’re probably in good shape. However, I am a little concerned that I see this mindlessly parroted by people instead of doing actual analysis. When you introduce an (admittedly snappy) heuristic as a proxy for genuine analysis, you risk introducing complacency, which can have nasty second-order effects.
Ultimately, what counts as good backup advice depends heavily on your requirements, and your tolerance for issues. Hardware failure is a rare enough event for me that I am okay with downtime and losing a Saturday to fix it, but I’m not okay with data loss. There are environments (like my job) where the trade-offs here are very different!
The backups industry has noticed and tried to fix some of these issues. If you search for ”3-2-1 backup” today, you’ll find articles espousing the benefits of 3-2-1-1, or 3-2-1-1-0. And look. At some point mnemonics cease being useful and start being distractions, and when most permutations of small numbers are valid backup doctrines, I think we’re past that point. Data security is a complex topic, and you can’t take shortcuts here. Serious people have to think seriously about it all the time, and that work deserves more than a sentence.
Analyse Failure Modes
I don’t want to be all bad news and curmudgeonly here, so I do actually have a guiding principle that I think can be used to derive a good backups practice. It’s quite simple:
What failure modes do I want to be resilient to, and what failure modes can I tolerate?
That’s it, basically. There are other concerns that typically involve budget, but broadly speaking everything else falls out of this question. Some failure modes that we should all probably consider:
Ransomware
This is a big one that 3-2-1 fails to account for, and the major improvement that its descendants try to solve. If all of your data stores are accessible online, there is a theoretical risk of access to it being removed and ransomed back to you. While you can and should make reasonable effort to secure your infrastructure, the systematic solution to ransomware attacks is to store one complete copy of your data completely offline.
This carries complexity. If it’s offline, it cannot be realistically kept up to date in the same way as your other backups. Retrieval will necessarily be more expensive and/or annoying. But that’s okay for a failsafe, and it is vital. If you are storing anything anywhere near the internet, I don’t think ransomware protection is optional.
Physical Emergency
This covers your house fires, tornadoes, floods, as well as incidental damage caused by children, pets, enemies, and suchlike. The aim for me is that anything in my fleet be completely replaceable only by spending an affordable amount of money. If my home server were to spontaneously atomise, that would suck, but I would have lost nothing barring the hardware. With enough time and money, I can recreate the exact same state from before the disaster.
There are limits here. If a disaster affected my apartment in downtown Toronto, and simultaneously affected my parents’ house in the East Midlands of England, and simultaneously my friends’ apartment in a different part of Toronto, then my data is in trouble. But guess what. If that happens I am almost certainly dealing with a set of circumstances that is dramatically affecting the rest of my life, to the extent that I am probably less concerned with my music library, or even my photos.
Any backup strategy is optimising around probabilities and lifetimes, and you’ll find similar trade-offs in any engineering discipline. I am protecting against extremely bad luck. I am not protecting against worse than that. By the time I’m risking data loss, several people I care about are likely imperilled. That’s my risk tolerance.
The Human Touch
They say not to ascribe to malice what can be adequately explained by stupidity. A system of any kind is not a system if it does not plan for and handle people in the system making mistakes. We are all making small mistakes all the time! We are deeply imperfect, completely non-deterministic, and fallible at all levels. Does the system survive me typing the wrong commands? Does it survive me accidentally deleting something important? Does it survive me not doing something that might be necessary?
I recommend thinking deeply about this. Generally speaking you want to automate as much as you can, including the monitoring of the health of any backups, and have retention policies that keep at least some history. The exact shape of that is absolutely open, but a sensible default is hourly for 24 hours, daily for a month or so, and (perhaps) monthly beyond that.
Hardware Failure
HDDs die. SSDs die. All components die, sometime spectacularly. Almost all components give warning signs before they fail. The really nasty part of drive failure in particular is that it is easy to believe that running a redundant setup protects against component failure. It does, but there is some serious nuance here. Suppose you have a mirrored pair of drives, and one of them croaks. You can add a new drive to the array, but that means it has to be rebuilt from the other drive in the mirror. This involves reading everything on that drive, which is a lot of work! Rebuilding an array is very demanding for the other drives in it, which increases the odds of one failure finding another.
Beyond plain redundancy, I also have metrics set up that track the health of all of my drives through automated tracking of metrics through ZFS as well as SMART. This is not a perfect solution. Redundancy of more than 2x would halve the rebuilding workload, for instance. But for me, with my budget and space constraints, as well as my ability to spend a frustrating weekend restoring from a backup if a rebuild fails, this is the appropriate compromise.
An ounce of prevention is worth a pound of cure! I chose good components. That means well-reviewed SSDs, and hard drives advertised for NAS usage. I talked a lot more about component selection in part two of this series.
Bit Rot
Beyond catastrophic failure, things can be corrupted, sometimes without you knowing. There is a dizzying array of weird stuff that happens rarely, but often enough that over a number of years, running 24/7, they become a guarantee owing to the law of large numbers.
ZFS is an example of a filesystem that builds in checksums of its blocks to find these corruptions, and can even automatically recover from some of them. This was a factor in my choice of filesystem. For memory, ECC RAM uses checksums to prevent bit corruptions. With memory prices being what they are right now, the average home user cannot really afford to be picky here. Even in the halcyon days of mid-2025, I couldn’t stomach the additional cost and inventory complexity of components with ECC-support given my tolerance for cost, power consumption, and noise.
Multiple backups help here, but any good data level backup will faithfully reproduce errors. That’s why I have logical backups of my primary database cluster, and of anything else where that makes sense, before delegating to lower-level ZFS-based backup beyond that.
Theft
Storing your backups in another location means that someone else will likely have access to it at some point, which is probably not a thing you want. When it comes to protecting data, we generally reach for encryption, and then we reach for other forms of protection, like physical security. Encryption is not a simple answer so much as it’s a fractal problem. You can encrypt your data, but storage of the credentials for that encryption becomes the same problem in microcosm.
If you’re going to encrypt your backups (which I strongly advise), you need a story for secret storage. Luckily, protecting and safely storing a credential is a much smaller problem than protecting and safely storing terabytes of data.
Encryption Credential Loss
I rely heavily on encryption. My ZFS pool uses encrypted datasets that are mirrored bit for bit to remote devices, but without the keys. This renders them completely opaque to anyone without these keys, which are stored on my home server’s boot drive. This means that were my home server to be atomised (as above), I would be unable to access all of my data! To that end, these encryption keys are also stored in my password storage, as well as in a file in my infra repository, protected with ansible-vault.
Service Provider Changes
I see this come up a lot more lately about AI providers, where the frontier labs change their models or their pricing and devastate whatever new startups are essentially just front-ends to those APIs, but it is relevant in all areas where you depend on a small number of providers. Backblaze might change their policies, or S3’s prices might increase with rising hardware costs. Beyond that, depending on any third party provider requires that you trust them and their processes, for now and on an ongoing basis.
I try to minimise such dependencies. That’s really the entire raison d’être of Project Sunshine. The dependencies I do have (notably Tailscale) do not pertain to data storage and backups, and that is by design. Where I do rely on third parties, I try to guarantee a few things:
- An alternative exists. By far the most important value in this list. If there is exactly one provider of a service on which you depend, some kind of problem is only a matter of time. The remaining question is how big a problem it will be.
- I’m not on the frontier. I do not want to be a use case that is on the edge of basically anything. I don’t want to be the largest user of some new filesystem, I don’t want to be a customer who is pushing a service to the edge of its usage limits. I want my workloads to be child’s play for the tools I select.
- I understand the business relationship. Lots of tech companies don’t make money, or don’t make money in obvious ways, or make money in obvious ways that I find objectionable. When procuring things like this, I like to know that the provider is a going concern, and I like my relationship to be simple. Typically this means that I can obviously see how me paying for the service they provide is profitable, and that my relationship with them is not otherwise exploited for additional revenue.
A Backup Is Not a Backup Until It’s Been Tested
This is the most crucial advice that I can give. You can send data wherever you like, but until you’ve validated the return path, it’s worth only slightly more than no backup at all. Organisations are generally required to run a raft of disaster recovery tests at least annually that perform this exact function. If the infrastructure you manage is on behalf of an organisation, then you already do this. If it’s a hobby project, then I recommend you do it too.
Restoring from a backup is a task that increases with complexity as the size of the restoration increases. If I accidentally delete something in a file, it’s a relatively straightforward process to pull it back out of the Git repository, or perhaps Time Machine, or Restic. I put it back, and I’m done. At the other end of the spectrum, if I need to recover from a ransomware takeover, I’m wiping everything, building servers back from scratch, and mounting my remote backup to pull data back, a manual and tortuous process. But you have to run those processes.
I had the good fortune to perform a disaster recovery myself while building out this backup process. I accidentally wiped my boot device while trying to update something. This was foolish, but I recommend my section above on ”The Human Touch”. However, because of my backup processes, I didn’t have to be careful. It was sufficient to reinstall Fedora, run my Ansible playbooks, and run a script that restored the Restic backups to the right places. This entire exercise took me about 90 minutes, and I was left with a much stronger understanding of my local backups. If I had my time again, I would not do this, nor do I recommend it. But recovering from a real disaster is akin to making lemonade when I (in this case) hand myself lemons.
For my remote backups, I have mounted each volume on the mini-PC in the UK, and verified that they are in fact visible. Beyond that, any recovery from that node probably involves me flying over to pick it up, so I am not concerned about testing my ability to send/recv back over the network. The offline backups I validate periodically by mounting and inspecting.
The key point isn’t avoiding disaster, even though we should all be doing our best to do so. The key thing is preparing for it, because it will come by eventually. Maybe because of someone nefarious, but more likely because none of us are to be trusted at a Linux command line.