dpzmick.com

Homelab Act 3: NAS, ZFS, and NFS

For the last 172 days (according to uptime on one my routers), I've been setting up a small homelab. Homelabs are kind of cool, and, the setup has been interesting. I'll be writing a few posts explaining the steps I took.

In this homelab post, I'll be detailing how I converted the R720 I bought on eBay into a NAS server on my local network. This was an expensive project; I'm not sure it was worth the time or energy. I'll add more discussion on the usability of something like this in the 4th part of this series.

Why?

I recently had an old SSD fail. I had no idea it was going bad until I tried to open a handful of old photos and found that they were lost. Some of the smart counters for the drive apparently had ticked up, but I never had set up anything automated to monitor this, so it was missed.

I wanted a safe place to store some important files. I didn't want to use a cloud service for this, since I've been trying to scale back my cloud dependence and vendor lock-in problem. For local disk, I would certainly need some form of RAID array, and be I needed reliable backups.

Currently in the root of the /nas contains:

  • audio samples
  • facebook/google data exports
  • iso files (windows iso, recent Arch iso, etc)
  • some ripped CDs that are not on streaming services
  • RAW photos
  • old programming projects zipped up
  • a bunch of papers as pdfs in a folder, unorganized
  • a bunch of old school work
  • org mode files

Currently this is only using about 500 gigs of space.

Most of this used to live in either a spinning rust drive on my desktop, or my home directory (but not really synced other than occasional rsyncs). These files were all backed up by CrashPlan. Recently CrashPlan's restores got super slow, so this was an opportunity to try something new.

This post will discuss setting up the ZFS, mounting it as a NAS, backups, and my experience actually trying to use this system.

Setup and Planning

Since I was only using about 500 gigs of space, I wasn't going to need some super high density system. I decided to instead try and build something that would be 1) low power, 2) pretty "fast", 3) "fun".

Since this is all going into my only server, I'd also need some space for all of the other services I wanted to toy with, so I came up with this:

Data Category Space Speed Redundancy
Personal Data 2TB should last years fast-ish needs to be highly redundant
server drive 500GB more than enough fast doesn't need to be redundant
OS drive 500GB fast doesn't need to be redundant

The server has 16 2.5 inch SAS/SATA drive bays connected to a "SAS backplane." The backplane is connected to a Dell-Rebranded LSI hardware RAID controller card. The RAID controller card is connected to the CPU via PCIe (in a Dell-proprietary form factor slot).

The good thing about all of this is that the backplane/RAID controller are well integrated into Dell's remote management tools. The bad thing is of course that many of these parts are proprietary, but more on that later. The RAID controlled uses 8 gen 3 PCIe lanes.

The backplane on server hold 16 drives and has across 8 SAS ports. Each SAS port can do 6 Gbit/s, so we can do (6*8)/8 = 6 Gbytes/s on the RAID controller + backplane, in theory. This is well matched with the PCIe bandwidth, which is theoretically around 7.9 gigs a second.

Given the speed constraints, the redundancy desire, and the low space requirements, I decided to try and build this entire thing with SSDs. SAS SSDs designed for servers aren't cheap, so I decided to look at crappy SATA consumer SSDs.

Apparently, most RAID systems don't really like expanding the number of disks in the array. I decided to price out filling up the system with disks.

I ended up with 4 disks from 4 different vendors (reduce risk of all of them failing at the same time):

  • 4x ADATA SU635 480GB 3D-NAND SATA 2.5 inch Internal SSD
  • 4x SanDisk SSD PLUS 480GB Internal SSD - SATA III 6 Gb/s
  • 4x Kingston 480GB A400 Sata3 2.5 Internal SSD
  • 4x PNY CS900 480GB 2.5” SATA III Internal Solid State Drive

From amazon, this ended up costing like $800, which is, uh, not very cheap. I also had to get some disk enclosures on eBay to install these disk into the server.

Filesystem

Next thing up, I needed to pick a filesystem/RAID scheme to run on these drives.

Hardware RAID

I bought the "upgraded" RAID controller when purchasing the server, since I wanted to keep my options open. After tinking about harder about it, hardware RAID doesn't really seem that interesting to me. Hardware RAID might be a win if I didn't have tons of RAM to spare, or if I was very CPU constrained. Since neither of those are the case, it seems wise to use powerful CPUs and large amount of RAM available on the server to do fs checksumming and for caching purposes.

ZFS

Awesome ZFS features:

  • Great reliability features (checksumming in metadata)
  • Snapshots
  • Very flexible RAID configurations
  • Theoretically easy to setup and try out
  • Fantastic monitoring tools (check out the influxdb intergrations)
  • Great reviews online
  • Great arch-wiki documentation
  • Apparently no RAID write hole problems despite not having a dedicated write-flush backup battery like HW would have

Linux software RAID (mdadm and LVM)

Looks great, but not as featureful as ZFS. If I try ZFS out and it doesn't work, I figured it would be easy to switch.

BTRFS was eliminated early as it seems to still be fairly immature.

Installation

Physical Install

Getting these drives into the server was easy. Just screw them into the enclosures:

disk_in_enclosure.jpg

Then pop them into the front mounting slots: all_installed.jpg

Configure RAID

Next up was configuring the RAID controller to get out of the way. I wanted the raid controller to just pass the disks through to the operating system. It also seemed important to make sure that I could access the S.M.A.R.T. status of the devices.

Surprise surprise, the upgraded RAID controller I purchased is not able to do this! Apparently, the lower end model is, but only if you flash the thing with some special alternative firmware that breaks all of the fancy Dell integration.

Regardless, I booted the machine with some of the drives installed to see what would happen. The dell controller was not happy with the consumer drives. It marked a number of them as degraded, and thought that the kingston drives were SAS drives (maybe they actually are? never figured this out). Fortunately, it seemed like all of the drives were working.

Downgrade controller

After a very very large amount of time spent googling around, I found some references that said that, if you get the downgraded Dell H310 mini controller, it is possible to flash the controller to an alternative LSI "IT mode" firmware. The IT mode firmware is supposed to allow you to just pass the disks through to the OS.

There's a guy on eBay who will sell you on of these pre-flashed. Search for "Dell H310 mini monolithic K09CJ with LSI 9211-8i P20 IT Mode" then just buy one from him.

I of course didn't go down this path. Instead, I found some PDF file on archive.org that contained some instructions for flashing the controller. Since references to this file seem to all go stale, I'm mirroring it here, although I keep redoing my blog so this link will probably go stale too.

I booted an Arch Linux iso through the remote managment interface and configured everything from Arch. To follow these instructions, you'll have to find the LSI firmware files. Since LSI has been acquired like 30 times, its not entirely clear where to go. To find these file names, I had to read a bunch of other blog posts with a bunch of references to the files you need to download.

You're looking for:

  • 9211_8i_Package_P20_IR_IT_FW_BIOS_for_MSDOS_Windows.zip or 9211_8i_Package_P20_IR_IT_FW_BIOS_for_MSDOS_Windows.zip. After unzipping, you'll find Firmware/HBA_9211_8i_IT/Firmware/HBA_9211_8i_IT.bin
  • UEFI_BSD_P20.zip. After unzipping, you'll find uefi_bsd_rel/Signed/x64sas2.rom

There's a note in the PDF that says:

Should you want to boot off a drive attached to the H310MM, you will also have to flash the appropriate bootrom (mptsas2.rom for BIOS, x64sas2.rom for UEFI).

This is a very true statement and you'll be stuck scrathing your head for a long time if you miss it. Make sure to also flash the efi firmware to the device.

Disk inventory

Since the Dell firmware integration is all broken with the new firmware, I needed to be able to keep track of which drive was which without having being able to easily toggle the chassis LEDs. I booted an Arch ISO and started dd=ing zeros to each disk through =/dev/disk/by-id/, then recording the serial numbers of the disks whose activity LEDs lit up. For some reason, the activity LEDs won't light up on the ADAT disks, so I just popped those in and out and watched the kernel logs.

All of the serial numbers and slot assignments are saved in a safe place.

Install Operating System

From the Arch iso, I partioned the disk in the 0th slot, and installed Arch using the standard install guide.

rootdelay

The OS install when smoothly so I thought I was done with this ordeal. Wrong!

Linux consistently failed to boot. I'd get through a GRUB screen, load initrd, then consistently fail to find the root partition. The root partition was on the same drive as GRUB, so this doesn't really make sense.

Apparently, when booting, the EFI system initializes the controller to get the bootloader, Linux init ram disk, etc. But then, when the initrd starts, something in Linux's drivers causes the SAS controller to reinit. The controller takes a long time to initialize, so Linux will have a hard time finding it's boot disk.

Adding rootdelay=600 to my kernel command line got me passed this problem; now Linux waits for root partition to show up for 5 minutes before failing to boot.

Configure ZFS

Just follow the instructions on the Arch Wiki. I installed the DKMS version of ZFS so that I would be able to pacman -Syu and have pacman attempt to rebuild ZFS with the latest kernel.

I setup two zpools. One for my personal files named nas and another for server stuff named server. These are mounted, creatively, at /nas and /server.

nas

For the nas zpool, I'm using 12 disk with data striped across two RAIDZ2 zpools. In other words, each of the RAIDZ2 pools can loose two disk without failing. All of my data is striped across these two pools. I have no idea if this is a good idea.

From a performance perspective, check out this post:

For performance on random IOPS, each RAID-Z group has approximately the performance of a single disk in the group.

So, the performance isn't going to be fantastic on the nas array. I'll pretty much only be aggregating across the two stripes, so, assuming read/write of 500mb/s on a standard SATA ssd, I should expect read/write spees around a gig a second for the pool. Fortunately, that's exactly what I'm getting.

server

The server array is just a single raidz1 array with 3 disks in it. This array isn't that interesting and I haven't tried to push it very hard yet.

Contiguous reads/writes run at ~400-500mb/s, as expected, since this is about the rate I'd expect from a single drive in the array.

Perfomance

For the nas array:

# copy 5 GiB file of random bytes from /tmp (ramdisk), to the ZFS array
$ dd if=/tmp/test of=test bs=2M
2560+0 records in
2560+0 records out
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 5.51472 s, 974 MB/s

# on the server, read the file we just copied to nowhere (immediately after writing)
$ dd if=test of=/dev/null bs=2M
2560+0 records in
2560+0 records out
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 2.96748 s, 1.8 GB/s

# same thing again
$ dd if=test of=/dev/null bs=2M
2560+0 records in
2560+0 records out
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 2.2822 s, 2.4 GB/s

# drop page cache and zfs arc cache, reread same file
$ dd if=test of=/dev/null bs=2M
2560+0 records in
2560+0 records out
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 4.96745 s, 1.1 GB/s

Sidebar: NVMe

My desktop has a single $300 NVMe drive in it. Compare:

# copy 5 GiB file of random bytes to NVMe
$ dd if=/tmp/test of=test bs=4M
1280+0 records in
1280+0 records out
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 4.76216 s, 1.1 GB/s

# copy to nowhere (pagecache)
$ dd if=test of=/dev/null bs=4M
1280+0 records in
1280+0 records out
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 0.468432 s, 11.5 GB/s

# drop caches and try again
$ dd if=test of=/dev/null bs=4M
1280+0 records in
1280+0 records out
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 1.76705 s, 3.0 GB/s

One NVMe/PCIe drive is destroying this expensive array.

In theory, if I stripped across all of these SSDs I'd be able to get competitive, but I have bigger unresolved performance issues with NFS and I already have valuable data on this array, so I have not tried this yet.

NFS

Trivial NFS is easy to setup with ZFS. You can simply install the right NFS servers, then tell ZFS to export the mount point.

NFS performance

Unfortunately, NFS over my 10 GbE network doesn't perform as well as you'd hope.

From an NFS mount over 10 GbE (default mount options, few seem to make a difference but I have more to learn here):

# copy a 5 GiB file of random bytes from /tmp (ramdisk), to the NFS mount
# From switch stats: NFS isn't saturating the link for some reason.
$ dd if=/tmp/test of=test bs=1M
5120+0 records in
5120+0 records out
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 10.2211 s, 525 MB/s

# read the file we just copied to nowhere (immediately after writing)
# again, the switch maxed out at 4gbps during this transfer..
#        but mostly was nowhere close to the limit
$ dd if=test of=/dev/null bs=1M
5120+0 records in
5120+0 records out
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 16.3145 s, 329 MB/s

# same thing again
# better, this time I'm hitting the page cache on my RYZEN box
$ dd if=test of=/dev/null bs=1M
5120+0 records in
5120+0 records out
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 0.614145 s, 8.7 GB/s

# drop page cache, reread same file
# again, same deal
$ dd if=test of=/dev/null bs=1M
5120+0 records in
5120+0 records out
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 9.64609 s, 557 MB/s

As of this time, I haven't attempted to figure out why these rates are so poor. Trivial network tests with iperf3 and some custom code indicate that my NIC drivers and switch are all working properly, so there must be something I need to tune somewhere in the NFS layer.

I can trivally saturate gigabit with these rates, which means I'm also trivialy saturating the uplink through my VPN as well. Since I'm currently spending more of my time connected to the VPN from remote places (with less than gigabit bw), optimizing the NFS has not been a priority.

Using the system

NFS works as well as I'd expect it to, but I'll discuss this and a few other details in a future post.

I've copied a bunch of files onto the nas mount from my laptop and desktop, both locally and remotely. ZFS has been rock solid and the DKMS builder has rebuilt the modules successfully so far during kernel upgrades.

A ZFS scrub detected on checksum error, but fixed itself. All disk report that they are healthy. Cosmic rays?

The biggest win by far is having my orgmode files available on all of my computers without using some third party to do syncing.

Overall, I'm reasonably happy with this setup, although I'm wondering if I should have just setup some sort of FUSE mount of B2 and moved on with life. Getting this to work was a lot of work, and the amount number things that need to not break is large. The local network performance doesn't help me at all when I'm remote, which is most of the time.

homeview-sourceswitch-color-mode