Windows boot hell

Thanks Microsoft, very cool

Microsoft putting my Gentoo install recovery process hellish

This is a story about how merely installing Windows caused my Linux installation to mess up, and how I finally caught the problem after half a day of tinkering around.

TLDR: Windows likes to mess up other OS drives. Always install it first in isolation, then continue with any other low-level operations.

System layout

I dual boot Windows and Gentoo (a Linux distro), each on separate drives. Windows was on a 2 TB drive, and Gentoo was on a 1 TB drive. The Windows drive was filling up quickly and SSD prices were cheap, so I bought a 4 TB drive for that OS. Gentoo would then use the previous 2 TB drive.

I boot in UEFI mode. My bootloader is rEFInd. Yes, I’m weird, I don’t use GRUB like a normal person. Graphically, the two drives looked like this.

Gentoo (1 TB):
    /dev/nvme0p1 (boot partition)
    /dev/nvme0p2 (root partition)
Windows (2 TB):
    /dev/nvme1p1 (boot partition)
    /dev/nvme1p2 (reserved partition)
    /dev/nvme1p3 (Windows partition)
    /dev/nvme1p4 (recovery partition)

These aren’t the real partition names; /dev/nvme0p1 is simplified from /dev/nvme0n1p1 because that’s ugly. The goal was to move Gentoo to /dev/nvme1 and move Windows to a new drive, /dev/nvme2, with the partition layout above.

Gentoo migration

Migrating the Gentoo install was extremely easy. btrfs has a great command that migrates an entire drive very quickly. The whole migration took literally a minute to complete once I figured out the process.

  1. Format the new drive with the partition layout of /dev/nvme0
  2. Use mkfs.fat on the boot partition. Don’t do this for the root partition, btrfs will take care of that
  3. Mount the unformatted root partition somewhere
  4. Run (assumes the root file system is labelled 1):
    btrfs replace start 1 /dev/nvme0p2 /dev/nvme2p2 /path_to_nvme2
    
  5. Once done, run:
    btrfs file system resize 1:max /path_to_nvme2
    
  6. Mount the new boot partition and install the kernel and bootloader to there

And that’s it. I turned off the computer, disconnected the old NVMe, and turned the computer on. Everything worked as if nothing had changed. Amazing. Now time to reinstall Windows.

The trouble begins - Windows installation

The best way to migrate Windows to a new drive is to install it on the new drive. Hard to believe that the only alternative is drive cloning software, but I didn’t want to use that. I got a USB with the Windows installer, install the OS, and boot into it without a problem. Then I rebooted the computer to boot into Gentoo, because I needed to copy some files into the new Windows install.

Note: the new Gentoo drive was still connected to the system at this time. When I want to move files from Linux into Windows, I mount the Windows partition (/dev/nvme1p3), then copy files to there. Windows makes 4 partitions when using UEFI boot, as shown before. However, I found out that Windows only made 2 partitions during installation this time. That’s weird, because that means Windows is booting in BIOS mode, which I don’t want. I reinstalled Windows to check if it was a system error, but I got two partitions again.

Eventually I figured out that the Gentoo drive was somehow interfering with installation. When reinstalling Windows for the third time and without any other drives attached, Windows finally made 4 partitions. At last, I had both OSes migrated to their new drives. I decided to boot into Gentoo to take care of some stuff, but something was wrong.

In BIOS, I previously set the Gentoo drive to be the default boot option. But the system immediately booted into Windows upon switching on. This wasn’t supposed to happen, so I booted into BIOS to see what was wrong. The Gentoo boot entry was gone.

Kernel panic

Gentoo not being listed as a boot option meant that Windows somehow messed up my Gentoo drive. Great, so Windows did something weird to a drive that it wasn’t told to touch. Solving this should have been simple: boot a Linux live environment, chroot into the drive that needs fixing, and reinstall the bootloader. After doing that, I tried booting into Gentoo again but something new happened: the kernel panicked with the classic unable to mount root fs on unknown block.

This error basically means the kernel couldn’t find the Linux root file system. There are multiple causes, such as not mounting the correct partition or not having proper file system support built into the kernel. Debugging this took a lot of time. I tried compiling newer kernels (which I knew had everything I needed already compiled), adjusting boot parameters, and reinstalling the boot partition multiple times. Nothing worked. I was stumped at this point, because I tried fixing all the common mistakes made in Gentoo customizing.

Something seriously went wrong

Many hours of trying to get a bootable system were wasted. For the longest time, I tried tweaking kernel parameters. If the boot drive couldn’t find the root partition, I had to force it to find the correct partition. The most effective method to declare the root partition is root=PARTUUID=.... The PARTUUID is a partition identifier that doesn’t dynamically change. Think of it like the address or PII of a partition.

I tried setting the root partition as a parameter, but that led to even weirder errors: init not found, devtmpfs failed to mount. I tried manually specifying the init binary, but that failed. In a desperate attempt, I tried to use GRUB as the bootloader. With GRUB, I didn’t see a kernel panic message. When I thought things were finally over, I got hit by another surprise: the system was stuck on loading linux-6.4.3.... This likely means either or both:

  1. A basic frame buffer is not configured
  2. GPU firmware couldn’t be loaded

So now I have a kernel configuration problem, which is weird. The last time I made big changes to the kernel was over a year ago, and it had been working ever since. Exhausted, I open my kernel’s config and was shocked to find that every kernel customization was reset. btrfs support? Not enabled. AMDGPU firmware? Unloaded. Intel microcode loading? Gone. Even vfat support was gone, which means booting was impossible from the start.

I think I messed this up while upgrading the kernel from 6.0.0 to 6.4.3. This was a last-ditch attempt to get my system working, but I must have forgotten to carry over my old config, or entered the wrong command. I spent about 30 minutes reconfiguring kernel options using the Gentoo wiki. Then I compiled the kernel, reinstalled GRUB, and finally got a working system after about 9 hours of debugging. I immediately went to bed after.

After waking up, I tried to get rEFInd working again in hopes to restore my original system. I nuked the boot partition, copied in the kernel again, and installed rEFInd. And everything worked. I was stumped at this point. No amount of kernel parameters or rEFInd configuration did anything the previous night, so why was it working all of a sudden?

I checked the rEFInd config file to see what was going on. And the kernel parameter specifying the root device was using root=/dev/nvme2p2. That’s the Gentoo root partition, but using the unreliable /dev/xyz name; the name that can change upon a new boot.

In all forum posts, all I saw were people recommending setting root=PARTUUID=... of the root device. Makes sense, because that label won’t change, unlike /dev/nvme2p2. My root drive is currently /dev/nvme0n1p2, but it might change to /dev/nvme1n1p2 tomorrow. If the device name changes, the kernel will panic again because it now can’t find the root partition. The reliable PARTUUID parameter just doesn’t work for me. Maybe it’s because I don’t use an initramfs. Maybe it has something to do with my customized kernel.

Whatever it was, I don’t want to think about it; it makes me exhausted just by typing this out. By the time rEFInd was working, it had been over 24 hours since I got Gentoo restored.

I don’t know what was the root cause, I don’t need to, and I kind of don’t want to. Until my current computer either completely fails or is burnt down by a fire, I’m never going to mess with bootloaders and Windows installations again. I doubt I would tolerate an overnight session of kernel debugging when I’m a little older. If only I had installed Windows first, and then migrated Gentoo; I wouldn’t have had any problems then. Lessons learned, Windows.