ZFS Commands

From Sea of Fate
Jump to navigationJump to search

Introduction

There are lots of commands that will show and control how the ZFS pools are run on the Proxmox hosts so we need a brief guide to ZFS

What is ZFS

To understand ZFS at a technical level, you must treat it as a combined file system and logical volume manager. This integration allows the file system to be aware of the underlying disk structure, which is the core of its data integrity features.

Physical Disks (vdev members)

The base layer consists of block devices (e.g., /dev/nvme0n1). ZFS uses Copy-on-Write (CoW) at this level. When data is modified, ZFS writes the new data to a new block and then updates the pointers, rather than overwriting the old data. This prevents data corruption during power loss.

A VDEV (Virtual Device) is the mathematical arrangement of physical disks.

  • Mirror: Data is written to n disks. Read IOPS scale with the number of disks; write IOPS match a single disk.
  • RAID-Z1: Distributes data and a single parity stripe across n+1 disks. It provides n disks of capacity eg 3 16Tb HDs as Z1 will give ((3-1) x 16)32TB usable capacity or 4 x 16TB would give ((4-1) x 16) 48TB.
  • VDEV Failure: If a VDEV loses its redundancy (e.g., two disks fail in a RAID-Z1), the entire ZPool is lost because ZFS stripes data across all VDEVs in a pool (RAID 0 across VDEVs).
  • Special VDEVs: This is where things like L2ARC (Cache) or ZIL (Log) live. They are "helper" pillars for the main structure but don't contribute to the storage capacity.

ZPool (Storage Pool) The ZPool is a logical aggregation of one or more VDEVs. All VDEVs in a pool share a single "Free Space" map. Adding a new VDEV to a pool increases the pool's capacity and performance (IOPS) immediately.

  • You don't "size" a ZFS pool; it is simply the sum of its VDEVs.
  • If you add a second Mirror VDEV to an existing Pool, the Pool grows instantly.

How This applies to Pear array of three 16 TB hard drives

  • Each of the drives will be shown as about 14.5TB
  • The usable capacity will be about 29 TB and can be show with the command
zfs list pearpool
  • The vdev will be shown as a raw capacity of 43.7 TB with the command
zpool list pearpool
  • To show the individual components of the vdev including the L2ARC cache we use the verbose switch
zpool list -v pearpool

Datasets and Zvols

In the ZFS world, everything sitting on top of a Pool is technically a "Dataset," but they diverge into two distinct types based on how they present storage to the operating system.

ZFS Datasets (The Filesystem Layer)

A "Dataset" (often specifically called a Filesystem Dataset) is a POSIX-compliant filesystem.

  • How they behave: They act like high-powered folders. When you create one (e.g., zfs create pearpool/archive), it is instantly mounted as a directory in Linux.
  • Property Inheritance: Datasets exist in a tree. If you set compression=lz4 on the parent pool, every dataset you create under it inherits that setting automatically.
  • No Fixed Size: Unlike a partition, a dataset has no "size." It simply takes what it needs from the pool's free space. You use Quotas to stop it from eating the whole pool.
  • Proxmox Use Case: Proxmox uses Datasets for LXC Containers. Because containers share the host's kernel, they can write directly to a ZFS filesystem. It also uses them for "Directory" storage (to hold ISOs or .qcow2 files).

ZVOLs (The Block Layer)

A ZVOL (ZFS Volume) is a Dataset that represents a Block Device.

  • How they behave: Instead of appearing as a folder, a ZVOL appears as a raw disk in /dev/zvol/pearpool/vm-101-disk-0.
  • Fixed Size: You must define a size when you create a ZVOL (e.g., zfs create -V 50G pearpool/win11-disk). The OS using it thinks it is a physical 50GB hard drive.
  • Abstraction: A ZVOL allows you to run a different filesystem (like NTFS for Windows or EXT4 for Linux VMs) on top of ZFS. ZFS manages the blocks, but the VM manages its own files.
  • Proxmox Use Case: This is the default for VMs (KVM). When you create a VM on ZFS storage, Proxmox creates a ZVOL. This allows the VM to treat the storage as a real SCSI/VirtIO hardware disk.


ZFS Health and Maintenance

ZFS is a "self-healing" system, but it requires the administrator to monitor its signals. Because ZFS manages its own redundancy, standard Linux disk tools (like df -h) often provide incomplete information.

Monitoring the "Pulse"

To see if your pool is healthy, you must check the Pool Status. This is your primary diagnostic tool.

Command:

zpool status pearpool

What to look for:

  • STATE: Should always be ONLINE. If it says DEGRADED, a disk has failed or is disconnected.
  • READ/WRITE/CKSUM Errors: These numbers should be 0.
  • CKSUM (Checksum) Errors: If this is non-zero, ZFS has detected "Silent Data Corruption" and fixed it using the parity/mirror data. It is a warning that a cable or a disk is starting to fail.

Active Self-Healing

A "Scrub" is a deep-scan of the entire pool. ZFS reads every single block and compares it against its mathematical checksum. If it finds a mismatch, it repairs the data automatically.

Command:

zpool scrub pearpool

When to do it:

  • Proxmox usually schedules this monthly, but you should run it manually if you experience a hard power cut or suspect a disk is acting up.
  • Performance: You can still use the server during a scrub, though disk latency may increase slightly.

The 80% Rule

ZFS uses a Copy-on-Write mechanism. To write new data, it needs to find empty, contiguous blocks of space.

The Threshold: Once a pool exceeds 80% capacity, ZFS spends more time searching for "holes" to write data into, which drastically slows down performance.

The Danger Zone:' At 90%, ZFS changes its allocation algorithm entirely. This "Fragmentation" can make a fast NVMe drive feel like an old mechanical 5400rpm disk.

Monitoring: Use zpool list pearpool or kiwipool to watch the CAP (Capacity) column.

zpool list pearpool

ARC: Understanding the RAM usage

ZFS uses the Adaptive Replacement Cache (ARC). It will attempt to use as much free RAM as possible to cache your most frequent data. The misconception is that we may see Proxmox reporting 95% RAM usage. This is usually ZFS being efficient, not a "memory leak." The Handshake: If a VM or LXC container needs that RAM, ZFS will release it instantly. We can check your cache efficiency with the command

arcstat

If we find the ARC (Adaptive Replacement Cache) is consuming more RAM than we would like even though it’s designed to give it back when needed we can place a "hard cap" on it. On a Proxmox host like Pear, this is done by modifying the ZFS module parameters. We don't "remove" it (as ZFS needs a cache to function), but we can strictly limit its growth. To directly write a value to the Kernel Module Parameter in real time, we echo the number of bytes directly to it eg 8 x 1024^3 for 8gb NB we don't technically need to be on the 1024 byte boundaries but as the OS will generally use RAM in pages it will make more sense to stick to 1024x1024x1024 x gigabytes

echo 8589934592 > /sys/module/zfs/parameters/zfs_arc_max

To see what the current value is (the one the kernel is actually using right now), we can cat that same location:

cat /sys/module/zfs/parameters/zfs_arc_max


The "Ephemeral" Nature

Because zfs_arc_max exists in the /sys/ directory (which is a temporary RAM-based filesystem), the above change will be lost when Pear reboots. To make it permanent, we have to move the setting from the virtual /sys/ folder to a real configuration file on the disk:

  • Create the file:
nano /etc/modprobe.d/zfs.conf
  • Write the option:
options zfs zfs_arc_max=8589934592
  • Commit it to RAMFS at next boot:
update-initramfs -u

This ensures ZFS loads with this limit the moment the server turns on.


The L2ARC Index Requirement

Adding a L2ARC does provide a high speed cache to an array of ZFS HDDs and may boost read performance but it does not come for free because the cache has to be indexed in RAM so that the system can check the contents of the cache before it searches the HDDs. On a standard setup, every 8KB block on the SSD requires roughly 70 to 80 bytes of system RAM for the index. We have a 1 TB SSD that is acting as a L2ARC. 1TB is roughly 1,000,000,000,000 bytes and if that was filled with 8KB blocks, it would have ~125 million entries. 125,000,000 times 80 bytes is approximately 10GB of RAM. In our system we would need more than the 10GB of RAM assigned to the cache to contain the index, any less will make the cache useless because the index will be constantly paged in and out of memory. Due to the high cost of the cache it is important to check how much use it is in real life. The command arcstatcan be used to show what the size of the cache is in use and it's hits run the following

arcstat -f l2size,l2hits,l2read

Update. After reviewing the usage of the of the cache we will now remove it as it is not doing any good and it is using some of the RAM so therefore in this case it is counter productive.


Removing the L2ARC

Having decided to remove the L2ARC we need to identify the SSD using the command:

zpool status pearpool

The output shows the members of pearpool including the Lexar SSD. Now we can run the following command to remove the SSD from pearpool

zpool remove pearpool ata-Lexar_SSD_NQ100_960GB_QCT180R0006120S334

If we now run another status check we should see that pearpool now consists of the three Ironwolf Pro HDDs and no cache. We could now shutdown the pear host and remove the SSD or make it into a usable intermediate speed speed storage drive. We will add it on as another storage drive with the command#

zpool create fastpool ata-Lexar_SSD_NQ100_960GB_QCT180R0006120S334

We can see the new pool fastpool along with the other two pools (rpool & pearpool) using the command

zpool list

Now to add it to the available storage for Pear to use we must select Datacenter Seaoffate then Storage. From the "Add" dropdown box select ZFS so that a dialogue appears and fill in the options

  • ID is fastpool (this is how the PVE will see it as)
  • ZFS Pool is fastpool (This is the zpool that it is its source)
  • Content is Disk image, Container (so it can be used for both VMs and LXCs)
  • Enabled is set
  • Thin Provisioned is set.

Since this is an SSD, we should ensure Proxmox "Trims" the deleted blocks so it stays fast. Since we are on a recent Proxmox version, we can enable the autotrim property on the pool:

zpool set autotrim=on fastpool

Managing Snapshots (The "Time Machine")

Snapshots allow you to freeze a Dataset or ZVOL in time. They are essential before making changes to a VM. Snapshots only consume space when the original data changes. If you delete 10GB of files but keep the snapshot, that 10GB is not freed until the snapshot is destroyed. Note that any snapshots created in the CLI will not be shown on the dashboard but snapshots made in the dashboard will be shown on both.

Create a snapshot for vm-127-disk-0 with the label ready_to_update:

zfs snapshot pearpool/PFP/pdata/VMHDS/vm-127-disk-0@ready_to_update

Delete:

zfs destroy pearpool/PFP/pdata/VMHDS/vm-127-disk-0@ready_to_update

List:

zfs list -t snapshot

will list all the snapshots n the system including pearpool/PFP/pdata/VMHDS/vm-127-disk-0@ready_to_update. To list only the snapshots on pearpool we can use the zfs list command combined with the -t snapshot to just show snapshots and the -r pearpool to recursively see just pearpool

zfs list -t snapshot -r pearpool

If we only want to see snapshots for a specific VM (e.g., VM 127) on Pearpool, we can give the exact path:

zfs list -t snapshot -r pearpool/PFP/pdata/VMHDS/vm-127-disk-0