ZFS Commands

From Sea of Fate
Jump to navigationJump to search

Introduction

There are lots of commands that will show and control how the ZFS pools are run on the Proxmox hosts so we need a brief guide to ZFS

What is ZFS

To understand ZFS at a technical level, you must treat it as a combined file system and logical volume manager. This integration allows the file system to be aware of the underlying disk structure, which is the core of its data integrity features.

Physical Disks (vdev members)

The base layer consists of block devices (e.g., /dev/nvme0n1). ZFS uses Copy-on-Write (CoW) at this level. When data is modified, ZFS writes the new data to a new block and then updates the pointers, rather than overwriting the old data. This prevents data corruption during power loss.

A VDEV (Virtual Device) is the mathematical arrangement of physical disks.

  • Mirror: Data is written to n disks. Read IOPS scale with the number of disks; write IOPS match a single disk.
  • RAID-Z1: Distributes data and a single parity stripe across n+1 disks. It provides n disks of capacity eg 3 16Tb HDs as Z1 will give ((3-1) x 16)32TB usable capacity or 4 x 16TB would give ((4-1) x 16) 48TB.
  • VDEV Failure: If a VDEV loses its redundancy (e.g., two disks fail in a RAID-Z1), the entire ZPool is lost because ZFS stripes data across all VDEVs in a pool (RAID 0 across VDEVs).
  • Special VDEVs: This is where things like L2ARC (Cache) or ZIL (Log) live. They are "helper" pillars for the main structure but don't contribute to the storage capacity.

ZPool (Storage Pool) The ZPool is a logical aggregation of one or more VDEVs. All VDEVs in a pool share a single "Free Space" map. Adding a new VDEV to a pool increases the pool's capacity and performance (IOPS) immediately.

  • You don't "size" a ZFS pool; it is simply the sum of its VDEVs.
  • If you add a second Mirror VDEV to an existing Pool, the Pool grows instantly.

How This applies to Pear array of three 16 TB hard drives

  • Each of the drives will be shown as about 14.5TB
  • The usable capacity will be about 29 TB and can be show with the command
zfs list pearpool
  • The vdev will be shown as a raw capacity of 43.7 TB with the command
zpool list pearpool
  • To show the individual components of the vdev including the L2ARC cache we use the verbose switch
zpool list -v pearpool

Datasets and Zvols

In the ZFS world, everything sitting on top of a Pool is technically a "Dataset," but they diverge into two distinct types based on how they present storage to the operating system.

ZFS Datasets (The Filesystem Layer)

A "Dataset" (often specifically called a Filesystem Dataset) is a POSIX-compliant filesystem.

  • How they behave: They act like high-powered folders. When you create one (e.g., zfs create pearpool/archive), it is instantly mounted as a directory in Linux.
  • Property Inheritance: Datasets exist in a tree. If you set compression=lz4 on the parent pool, every dataset you create under it inherits that setting automatically.
  • No Fixed Size: Unlike a partition, a dataset has no "size." It simply takes what it needs from the pool's free space. You use Quotas to stop it from eating the whole pool.
  • Proxmox Use Case: Proxmox uses Datasets for LXC Containers. Because containers share the host's kernel, they can write directly to a ZFS filesystem. It also uses them for "Directory" storage (to hold ISOs or .qcow2 files).

ZVOLs (The Block Layer)

A ZVOL (ZFS Volume) is a Dataset that represents a Block Device.

  • How they behave: Instead of appearing as a folder, a ZVOL appears as a raw disk in /dev/zvol/pearpool/vm-101-disk-0.
  • Fixed Size: You must define a size when you create a ZVOL (e.g., zfs create -V 50G pearpool/win11-disk). The OS using it thinks it is a physical 50GB hard drive.
  • Abstraction: A ZVOL allows you to run a different filesystem (like NTFS for Windows or EXT4 for Linux VMs) on top of ZFS. ZFS manages the blocks, but the VM manages its own files.
  • Proxmox Use Case: This is the default for VMs (KVM). When you create a VM on ZFS storage, Proxmox creates a ZVOL. This allows the VM to treat the storage as a real SCSI/VirtIO hardware disk.


ZFS Health and Maintenance

ZFS is a "self-healing" system, but it requires the administrator to monitor its signals. Because ZFS manages its own redundancy, standard Linux disk tools (like df -h) often provide incomplete information.

Monitoring the "Pulse"

To see if your pool is healthy, you must check the Pool Status. This is your primary diagnostic tool.

Command:

zpool status pearpool

What to look for:

  • STATE: Should always be ONLINE. If it says DEGRADED, a disk has failed or is disconnected.
  • READ/WRITE/CKSUM Errors: These numbers should be 0.
  • CKSUM (Checksum) Errors: If this is non-zero, ZFS has detected "Silent Data Corruption" and fixed it using the parity/mirror data. It is a warning that a cable or a disk is starting to fail.

Active Self-Healing

A "Scrub" is a deep-scan of the entire pool. ZFS reads every single block and compares it against its mathematical checksum. If it finds a mismatch, it repairs the data automatically.

Command:

zpool scrub pearpool

When to do it:

  • Proxmox usually schedules this monthly, but you should run it manually if you experience a hard power cut or suspect a disk is acting up.
  • Performance: You can still use the server during a scrub, though disk latency may increase slightly.

The 80% Rule

ZFS uses a Copy-on-Write mechanism. To write new data, it needs to find empty, contiguous blocks of space.

The Threshold: Once a pool exceeds 80% capacity, ZFS spends more time searching for "holes" to write data into, which drastically slows down performance.

The Danger Zone:' At 90%, ZFS changes its allocation algorithm entirely. This "Fragmentation" can make a fast NVMe drive feel like an old mechanical 5400rpm disk.

Monitoring: Use zpool list pearpool or kiwipool to watch the CAP (Capacity) column.

zpool list pearpool

ARC: Understanding the RAM usage

ZFS uses the Adaptive Replacement Cache (ARC). It will attempt to use as much free RAM as possible to cache your most frequent data. The misconception is that we may see Proxmox reporting 95% RAM usage. This is usually ZFS being efficient, not a "memory leak." The Handshake: If a VM or LXC container needs that RAM, ZFS will release it instantly. We can check your cache efficiency with the command

arcstat

If we find the ARC (Adaptive Replacement Cache) is consuming more RAM than we would like even though it’s designed to give it back when needed we can place a "hard cap" on it. On a Proxmox host like Pear, this is done by modifying the ZFS module parameters. We don't "remove" it (as ZFS needs a cache to function), but we can strictly limit its growth. To directly write a value to the Kernel Module Parameter in real time, we echo the number of bytes directly to it eg 8 x 1024^3 for 8gb NBwe don't technically need to be on the 1024 byte boundaries but as the OS will generally use RAM in pages it will make more sense to stick to 1024x1024x1024xgigabytes

echo 8589934592 > /sys/module/zfs/parameters/zfs_arc_max

To see what the current value is (the one the kernel is actually using right now), we can cat that same location:

cat /sys/module/zfs/parameters/zfs_arc_max

The "Ephemeral" Nature

Because zfs_arc_max exists in the /sys/ directory (which is a temporary RAM-based filesystem), this change will be lost when Pear reboots. To make it permanent, we have to move the setting from the virtual /sys/ folder to a real configuration file on the disk:

  • Create the file:
nano /etc/modprobe.d/zfs.conf
  • Write the option:
options zfs zfs_arc_max=8589934592
  • Commit it to RAMFS at next boot:
update-initramfs -u

This ensures ZFS loads with this limit the moment the server turns on.

Managing Snapshots (The "Time Machine")

Snapshots allow you to freeze a Dataset or ZVOL in time. They are essential before making changes to a VM. Snapshots only consume space when the original data changes. If you delete 10GB of files but keep the snapshot, that 10GB is not freed until the snapshot is destroyed. Note that any snapshots created in the CLI will not be shown on the dashboard but snapshots made in the dashboard will be shown on both.

Create a snapshot for vm-127-disk-0 with the label ready_to_update:

zfs snapshot pearpool/PFP/pdata/VMHDS/vm-127-disk-0@ready_to_update

Delete:

zfs destroy pearpool/PFP/pdata/VMHDS/vm-127-disk-0@ready_to_update

List:

zfs list -t snapshot

will list all the snapshots n the system including pearpool/PFP/pdata/VMHDS/vm-127-disk-0@ready_to_update. To list only the snapshots on pearpool we can use the zfs list command combined with the -t snapshot to just show snapshots and the -r pearpool to recursively see just pearpool

zfs list -t snapshot -r pearpool

If we only want to see snapshots for a specific VM (e.g., VM 127) on Pearpool, we can give the exact path:

zfs list -t snapshot -r pearpool/PFP/pdata/VMHDS/vm-127-disk-0