Kiwi & OpenAlex failure while indexing the 2026 Import

From Sea of Fate
Jump to navigationJump to search

Introduction

The problem with the indexing of the OpenAlex files was one of the first network related problems with the Home Lab. Since the "I/O Storm" and the resulting network blackout were complex, it is will be useful to document the sequence of events.

The Initial Problem

First Symptoms

The first sign was that some of the SSH sessions to tayberry, the VM hosting docker and the OpenAlex container, closed unexpectedly and some became unresponsive so had to be forced to close. The management interface for Kiwi also would not open in a web browser, the final thing was that the console for Proxmox only displayed a blinking cursor. All of the management for Kiwi was unresponsive so to shutdown kiwi we did the R E I S U B reboot so as to not corrupt the ZFS drives. To do the shutdown without pulling the plug we used the physical keyboard attached to Kiwi and held down Alt & PrintScreen and typed R E I S U B (Raw keyboard, End tasks, Interrupt tasks, Sync disks, Unmount, Boot). When the machine rebooted it had the insert a bootable drive screen so probably just a bios confusion thing where the boot order got scrambled, a cold start fixed it and it booted to the normal login.

Root Cause I/O Storm" & ARC Exhaustion

The problem was ZFS Scrub was initiated on the 14TB mechanical drives while Tayberry was performing heavy Random Write indexing into a 1.9TB OpenSearch index. The ZFS ARC (cache) was unconstrained, expanding to consume nearly all Host RAM. This caused a "Kernel Stall" on Kiwi, leading to a hard lockup of the Proxmox management interface and all guest VMs(all VMs was actually just one).

The Solution to the Initial Problem

First was the hard reset using R E I S U B forced power cycle of the Kiwi host and a cold boot reset the boot option back to SSD. To stop the problem from happening again we needed to limit the ZFS ARC to no more than 16gb RAM to ensure the Host OS and VM RAM remain protected during high I/O events. It is most likely that we will cut ZFS ARC down to 8 or 10GB as the system has only 64gb total. more information ac be found on the ZFS Commands pages. The quick way to set the maximum ARC size is to copy the value directly in to the module parameters but it will only be in effect until a reboot using echo command like:-

echo 17179869184 > /sys/module/zfs/parameters/zfs_arc_max

But it would have to be written to /etc/modprobe.d/zfs.conf and the initramfs would need to be updated. see the ARC: Understanding the RAM usage

  • When openAlex has finished indexing we will cut down the ARC to 8GB by changing a line in
nano /etc/modprobe.d/zfs.conf

change arc maxline to

options zfs zfs_arc_max=8589934592

save and close then run

update-initramfs -u
echo 8589934592 > /sys/module/zfs/parameters/zfs_arc_max
cat /sys/module/zfs/parameters/zfs_arc_max

Secondary Failure: The Network Blackout

On reboot Tayberry could not ping any host outside of Kiwi, but Kiwi could ping other hosts and they are sharing the same physical cable and vmbr0 so not a physical problem. The only difference between Kiwi and Tayberry is that Kiwi is untagged and Tayberry is tagged to VLAN 100. Tayberry did have an IP address and the NIC was up all of the settings were correct like the VLAN aware vmbr0 and Tayberry was assigned to the correct VLAN. However, VLAN-aware bridge (vmbr0) on Kiwi did not automatically restore the tagging for the Trunk port (nic0) and the command on kiwi:-

bridge vlan show

but it did not show the VLAN100 as it should have done with the software defined network and Tayberry's NIC being assigned to VLAN100 (as no tap200i0 devices were present). To restore the VLAN100 we used the command

bridge vlan add dev nic0 vid 100

Which did than allow Tayberry to ping the other VMs within VLAN100 on Pear. VLAN100 should have been added as part of the VLAN IDs in the GUI definition of vmbr0 and there should have bee a line in the file /etc/network/interfaces. To find out why it had had not been added properly we check the file with

nano /etc/network/interfaces

and we noticed that the line about bridge vids was missing so we added the line underneath the "vlan aware" line

bridge-vids 2-4094
ifreload -a 

Assigning a VLAN to a VM in the Proxmox GUI only tags the virtual interface; if the physical bridge port (nic0) isn't configured as a trunk (via bridge-vids), it will prune the tagged packets before they reach the physical wire. With these changes in place Tayberry was able to ping any IP addresses on VLAN100. The next problem was that Tayberry would not reach the dns server or any other VLAN. On Tayberry, using the command

ip route

Showed that the default rout had been dropped. As Tayberry is a Debian server build it has old style networking so the interface is defined in /etc/network/interfaces the easiest way to restore the gateway is to use the command:

sudo ip route add default via 192.168.100.1 dev eth0

With this finally done we could ping blackberry or any other host on pear. To make sure the default rout definition survives a reboot check the file

/etc/network/interfaces

Under iface eth0 inet static, ensure there is a line: gateway 192.168.100.1

The Third problem Service Recovery & OpenSearch Integrity

When the ZFS HDDs locked up there was possibly going to be some potential data corruption or "Stale File Handles" on the direct-mounted 5TB drive (/dev/sda) following the hard reset. However, because the direct-mounted drive (/dev/sda) was formatted with a journaling file system (or ZFS), and we performed a Sync (S) during the REISUB, the OpenSearch data remained consistent and the cluster returned to GREEN immediately.

The Solution

  • Health Check: Verified cluster status as GREEN using the OpenSearch Health API.
  • Resume Indexing: Modified the indexing script to jump to file 1274 (to account for the ~341M records already committed) and used ionice -c 3 to prevent the indexer from hogging disk bandwidth again.
  • Note: Resuming at file 1274 may result in an 'Overlap Zone' where the indexer processes records already in the database. OpenSearch handles this via versioning; the _count will not increase until the indexer reaches truly new data.


📝 Final Documentation Table

Failure Layer Symptoms Root Cause Permanent Fix
Hardware BIOS "Bootable drive" error !! Cold boot confusion Reset boot order in BIOS
Host OS Blinking cursor / Lockup ARC Exhaustion (60GB/64GB) Capped ARC to 16GB in zfs.conf.
Network Tagged traffic dropped Missing bridge-vids on Trunk Added bridge-vids 2-4094 to Kiwi interfaces.
Guest OS No internet/DNS Gateway route dropped Added gateway to Tayberry interfaces file.
App Stopped Indexing Host Crash Resume via script with ionice -c 3.