Kiwi & OpenAlex failure while indexing the 2026 Import

From Sea of Fate
Jump to navigationJump to search

Introduction

The problem with the indexing of the OpenAlex files was one of the first network related problems with the Home Lab. Since the "I/O Storm" and the resulting network blackout were complex, it is will be useful to document the sequence of events.

The Initial Problem

First Symptoms

The first sign was that some of the SSH sessions to tayberry, the VM hosting docker and the OpenAlex container, closed unexpectedly and some became unresponsive so had to be forced to close. The management interface for Kiwi also would not open in a web browser, the final thing was that the console for Proxmox only displayed a blinking cursor. All of the management for Kiwi was unresponsive so to shutdown kiwi we did the R E I S U B reboot so as to not corrupt the ZFS drives. To do the shutdown without pulling the plug we used the physical keyboard attached to Kiwi and held down Alt & PrintScreen and typed R E I S U B (Raw keyboard, End tasks, Interrupt tasks, Sync disks, Unmount, Boot). When the machine rebooted it had the insert a bootable drive screen so probably just a bios confusion thing where the boot order got scrambled, a cold start fixed it and it booted to the normal login.

Root Cause I/O Storm" & ARC Exhaustion

The problem was ZFS Scrub was initiated on the 14TB mechanical drives while Tayberry was performing heavy Random Write indexing into a 1.9TB OpenSearch index. The ZFS ARC (cache) was unconstrained, expanding to consume nearly all Host RAM. This caused a "Kernel Stall" on Kiwi, leading to a hard lockup of the Proxmox management interface and all guest VMs(all VMs was actually just one).

The Solution to the Initial Problem

First was the hard reset using R E I S U B forced power cycle of the Kiwi host and a cold boot reset the boot option back to SSD. To stop the problem from happening again we needed to limit the ZFS ARC to no more than 16gb RAM to ensure the Host OS and VM RAM remain protected during high I/O events. It is most likely that we will cut ZFS ARC down to 8 or 10GB as the system has only 64gb total. more information ac be found on the ZFS Commands pages. The quick way to set the maximum ARC size is to copy the value directly in to the module parameters but it will only be in effect until a reboot using echo command like:-

echo 17179869184 > /sys/module/zfs/parameters/zfs_arc_max

But it would have to be written to /etc/modprobe.d/zfs.conf and the initramfs would need to be updated. see the ARC: Understanding the RAM usage


Secondary Failure: The Network Blackout