Massive home server upgrades

Now I’m playing with power

New home server internals showing the hardware

My home server has undergone a massive upgrade. From the leftover scraps of my first desktop to using an e-waste Xeon, I’m officially entering the open ocean of home labbing. I’ll be talking about what the upgrade was, an almost disaster, and what I plan to use the new hardware for.

Story

I repurposed my first desktop to a home server a few years ago. It’s been through a good amount of stuff; from originally a gaming PC to a Samba file server, and a Docker test bed. Every now and then, I had an itch to upgrade it. When this happened, I searched websites for used hardware. Usually the search fails, because I can’t find something good enough for cheap. But this time, I struck some gold.

There was a seller on Kijiji selling almost a whole machine, with these specs:

Component Name
CPU Xeon Platinum 8124M (18C/36T)
Motherboard Asrock EPC621D8A
RAM 2666 MT/s ECC DDR4 (6 x 16 GB)
Cooler Noctua NH-U14S DX-3647

Here’s my old server for comparison:

Component Name
CPU Core i5 6500 (4C/4T)
Motherboard MSI B150 Gaming M3
RAM 2400 MT/s DDR4 (2 x 4 GB)
Cooler Stock Intel downdraft cooler (lol)

This is a huge upgrade. From 4 cores to 18. 8 GB of memory to 96 GB of ECC memory. And a cooler from a great brand as well. And how much did the seller want for this? $670 CAD. That’s a bargain. This is probably the best deal I’ve ever seen for electronics, ever. The cheapest I’ve seen the processor go for on Ebay is $300. The motherboard is hard to find and would cost at least a few hundred. 96 GB of ECC RAM will cost at least $200.

Another cool fact? I didn’t have to spend money on a case or power supply; I was able to reuse them from my old home server. Keeping everything in an ATX mid-tower was important, because I don’t want a rack server (yet): they’re loud. So effectively, I got close to a whole home lab cheaper than a RTX 4070. I don’t have a GPU yet, and that will require a PSU upgrade. But even with those two parts, I doubt I’ll go over $1.2K. That’s pretty good value, I think, for powerful server/workstation grade equipment.

Fun fact about the processor: the 8124M is an unofficial SKU made for Amazon, so it was never directly available to purchase. The most comparable official SKU that I could find from the same generation is the Xeon Gold 6154, and that had an MSRP of about $3.5K USD. The reason why the Platinum 8124M is so cheap? Probably because Amazon was liquidating them as e-waste. That’s why this server is partially rescued e-waste.

Almost horror story

Hint: It involves LGA pins.

LGA Pins Intel CPUs (and now AMD CPUs too) fit into a socket containing thousands of tiny pins. Collectively, the pins are known as a Land Grid Array (LGA). These pins must contact the underside of the CPU for everything to work. The pins are very fragile, you will probably deform them by dropping a screw.

LGA pins are also really hard to unbend, because they’re delicate and surrounded by hundreds of other pins. To protect them, the socket has a plastic cover that shields the pins when a CPU isn’t inserted. This cover is so important that manufacturers will refuse RMAs if it’s not attached.

It turns out the seller didn’t cover the CPU socket with its protective cover. When I pulled the motherboard out of its packaging and saw the exposed pins, I almost had a heart attack. If I grabbed the wrong part of the motherboard or the packaging pushed against the socket, I could have ruined the whole board. The motherboard has an LGA3647 socket, which means 3647 tiny pins. It’s also a server component, so it’s expensive. And because it’s second hand, there is no warranty or RMA.

I spent at least 10 minutes using a camera to check if any pins were bent. They looked fine, but only usage would tell the truth. Knowing that I could have screwed up a $700 purchase was really killing me. Surprisingly, the system powered on, which was already kind of miraculous. However, I found two big problems that I thought meant socket damage:

  1. The system would restart if I pushed the cooler (not supposed to happen)
  2. The BIOS was reporting an idle CPU temperature of 104 °C

The first problem is extremely bizarre. I have never seen it before from the ~10 systems I’ve interacted with before, so I immediately suspected damaged pins on the motherboard. The second problem usually indicates bad cooler mounting. But an overheating processor would also have a super hot cooler, which wasn’t the case. Because of this, I thought this problem was also caused by damaged pins.

Luckily, I was completely wrong.

The real cause was a loosely mounted cooler. I didn’t tighten the mounting screws enough, and that created just enough space for the CPU to move slightly. That’s not supposed to happen, and likely caused the problems I saw. Once the cooler was properly tightened, these problems went away.

Then a new problem showed up: unrecognized RAM. Only 3 of the 6 RAM sticks I had were recognized. When I tried to debug with only 2 or 4 sticks, only 1 and 2 were recognized respectively. Rarely, more sticks than usual would be detected on a boot. Then they would be gone the next boot. Occasionally, the high temperature error would return. I was suspecting damaged pins again, or maybe the seller had lied about something.

Again though, the problem was a loose cooler. Even though the cooler’s instructions said not to apply too much force, I had to tighten the mounting screws quite hard for the problem to disappear. Once the cooler was tightly screwed in, all 96 GB were detected. There were no CPU sensor errors and the cooler wasn’t budging at all. Finally, I could move on to installing an OS.

What to do with this much power

IPMI rocks

Before I move on to talking about actually using the server, I need to talk about IPMI (Intelligent Platform Management Interface).

I can’t get over how great this is. This is a piece of software and hardware that runs separately from the main computer. IPMI allows full remote control; the OS and GUI, the BIOS, and it also aggregates important sensor data. It’s like remote desktop, but on a much lower level than the OS. You can turn the machine on/off or access the BIOS from anywhere. You don’t need to use a monitor and keyboard plugged directly into the computer.

It really sucks that product segmentation artificially keeps IPMI exclusive to servers. Sure, PiKVM exists, but it would be nice if consumer-grade motherboards came with IPMI.

Gentoo binhosting

This is the primary reason why I wanted a powerful server. In a nutshell:

I like Gentoo, I talked about why I run Gentoo before. The problem is, I have weak computers that I would also like to install Gentoo on, but compiling would be a pain. Binhosting is the solution to that.

“Binhosting” is short for “binary hosting”. One powerful machine compiles everything once, and weaker machines download and use those binaries instead of locally compiling. The benefits to binhosting are:

I wanted to try binhosting for a while, but I didn’t have a computer to do the hosting. For binhosting to make sense, I needed a CPU that is as fast as, or faster ideally, than my desktop CPU, a 13600K. I also didn’t want to turn my desktop into a binhost. But finding one for a low enough price was tough. Turns out that the 13600K is hard to beat in value for production workloads.

But with an 18 core CPU, I can finally try binhosting. The pilot test will be done with an old Macbook. If everything goes well, I might make my desktop rely on binhosting too. I “benchmarked” by compiling some big programs on the 8124M, lower is better.

Here are the benchmark parameters, so that I can sound a little credible.

Package 13600K 8124M
llvm-16.0.6 16:50 11:19
rust-1.69.0 11:42 16:51
gcc-12.3.1_p20230526 14:29 28:17

As you can see, I was extremely loose on controlling variables. I’m not a hardware reviewer, this is an indicator of what I should expect from the 8124M as a binhost. The 13600K is much faster when compiling Rust, 2x faster for GCC, and significantly slower for LLVM. I think I know the reason for this.

While compiling, I had htop running to see how loaded the CPU gets. LLVM loads the processor at 100% all the time. GCC and Rust sometimes have only a few threads, or even just 1, active. This is what probably makes the 13600K faster. The 13600K clocks about 1.5 GHz faster than the 8124M and has much higher IPC. It will destroy the 8124M in a bursty or single-threaded workload. The 8124M is faster for multi-core workloads, but the processor needs to be completely loaded for this to show. And since I’m compiling on a tmpfs, faster DDR5 might provide a big performance boost.

As a side note, the 13600K also is much more power efficient than the 8124M. I run the 13600K in an ITX case, so it is power-limited to 110 W max. The 8124M has a TDP of 240 W and I have left it at stock. From the LLVM compilation data, the 8124M was about 50% faster for almost 120% more power. Makes sense, the 8124M uses an older silicon node and the architecture is old. It’s cool to see the staggering progress since the Intel stagnation era.

Overall, I’m happy with the 8124M’s performance. It might be slower than the 13600K in some cases, but it overall has enough power to be an effective binhost. Plus, the 8124M supports lots of ECC RAM (up to 1.5 TB!) and has better features of server hardware in general. If I’m not happy with the performance, I can go search for a newer CPU later on. In a few years, hopefully the halo W-3275M will be sold for e-waste too. That beast has 28 cores and boosts 500 MHz higher than the 8124M.

ZFS

The great RAM support on the 8124M will let me run ZFS pools. In case you get confused by some terms:

ZFS
Z File System. Described as the ultimate file system created, bar none. Is in an iffy spot due to Oracle and licensing. Said to benefit from ECC RAM
ECC RAM
Error Correction Code RAM. RAM (and all storage devices) can return errors in the data it holds. ECC RAM is used to correct these, which is important for projects where reliability is paramount. For example, you absolutely do not want errors to mess up a skyscraper simulation.

I use RAID 1 BTRFS pools right now to store data on my old server. These have my legal Linux ISO collection and backups, so they’re important. I like BTRFS and use it almost everywhere, but I’ve always heard about the legends of ZFS. I wanted to try it, but I didn’t feel bothered enough to switch. But with 96 GB of ECC RAM, I now can try the mighty file system without worry.

ZFS is a better file system that BTRFS. I’ll learn some fun things by experimenting with it, such as being able to scale up to massive data hoarding pools. Maybe I’ll have a >100 TB ZFS pool hoarding… something.

What’s next?

In general, having a server platform gives me a higher hardware “cap”. The motherboard has more PCIe slots, the CPU supports more features and a ridiculous amount of RAM, and I now have 4 NICs.

There’s so much that I don’t know about what I could do, which is exciting. I can run all the game servers and services I want without a problem. I’m going to mess with Docker again. I want to try connecting the new server to a VPS for file sharing. I also want to try expanding my knowledge on networks. Get a switch, replace my stock ISP router, and run VLANs. Maybe I’ll devise a self-hosted IoT network or finally allow a smart TV to connect to a controlled network. There’s so much to learn and do.

I plan on getting a workstation GPU to run neural network models like Stable Diffusion. This is a whole separate problem that will take time to resolve. Not to mention the money for a GPU; these things are not cheap.

For now, I’m starting simple. I just got Gentoo installed with a custom kernel and I’m still learning how to leverage IPMI. A good place to start would be to give a name to the new server. I’m changing things and going with Gaia, instead of naming after a spacecraft. Gaia symbolizes home, for a home lab.

Say hello to the world, Gaia.