Unified Monitoring Stack: Difference between revisions

From Sea of Fate
Jump to navigationJump to search
Line 84: Line 84:
       - targets:
       - targets:
           - 'quince.seaoffate.net:8080'      # cAdvisor for AI Stack
           - 'quince.seaoffate.net:8080'      # cAdvisor for AI Stack
           - 'blackcurrant.seaoffate.net:8080' # cAdvisor for Data Archive
           - 'blackberry.seaoffate.net:8080' # cAdvisor for Data Archive
           - 'tayberry.seaoffate.net:8080'    # cAdvisor for OpenAlex
           - 'tayberry.seaoffate.net:8080'    # cAdvisor for OpenAlex
            
            

Revision as of 23:48, 22 February 2026

📖Introduction

Mango, located at 192.168.110.133 on the Infra network, is the unified successor to the Prometheus & Grafana and Victoria triad. It serves as the central hub for the Home Lab's observability. Mango natively scrapes metrics from all Virtual Machines, the Proxmox host(Pear) and the services, stores them in a high-performance VictoriaMetrics time-series database, and provides a Grafana interface for visualization.

By consolidating these services, we reduce network overhead and simplify the management of our monitoring infrastructure while maintaining 12-month data retention on a dedicated 500GB storage pool.

🚦Security & Network Architecture

Mango sits within the Infra network. Because it aggregates data from every host in the lab, it is a high-value target.

  • Web Interfaces: Grafana (Port 3000) and VictoriaMetrics VMUI (Port 8428) are restricted via pfSense to be accessible only from the MGT network (Cinnamon/Lemon).
  • Scraping Flow: Mango acts as the source for all scrape requests. pfSense rules must allow Mango to reach out to Production, VPN, and Terminal networks on specific exporter ports (9100, 9113, 9117, etc.).
  • Storage Pool: Data is stored on a dedicated 500GB virtual disk (PearPool), mounted at /mnt/metrics_data to ensure that metric growth never impacts the OS root partition.

🏛️Environment & Storage Setup

The VM was created using the Debian Gold Master template.

  • Hostname: Mango
  • IP/Gateway: 192.168.110.133 / 192.168.110.1
  • Disk 1 (OS): 32GB
  • Disk 2 (Data): 500GB (Added via Proxmox)

Storage Initialization To handle the long-term metrics, the 500GB disk was initialized and mounted:

# Identify disk (sdb), format, and mount
sudo mkfs.ext4 /dev/sdb
sudo mkdir -p /mnt/metrics_data
sudo mount /dev/sdb /mnt/metrics_data
# Ensure persistence in /etc/fstab
/dev/sdb  /mnt/metrics_data  ext4  defaults  0  2

🔧Installation

⚡VictoriaMetrics Installation

VictoriaMetrics was installed as a native binary (not Docker) to replace both the Prometheus scraper and the Victoria storage VM.

  • User & Directory Setup
sudo useradd --no-create-home --shell /bin/false victoriametrics
sudo mkdir /etc/victoriametrics
sudo chown -R victoriametrics:victoriametrics /etc/victoriametrics /mnt/metrics_data
  • Binary Installation

Binaries were retrieved from the VictoriaMetrics GitHub.

wget https://github.com/VictoriaMetrics/VictoriaMetrics/releases/download/v1.xx.x/victoria-metrics-linux-amd64-v1.xx.x.tar.gz
tar -xvf victoria-metrics-linux-amd64-v1.xx.x.tar.gz
sudo mv victoria-metrics-prod /usr/local/bin/victoriametrics
sudo chown victoriametrics:victoriametrics /usr/local/bin/victoriametrics
  • Service Configuration
sudo nano /etc/systemd/system/victoriametrics.service
[Service]
ExecStart=/usr/local/bin/victoriametrics \
  --storageDataPath=/mnt/metrics_data \
  --retentionPeriod=12 \
  --promscrape.config=/etc/victoriametrics/prometheus.yml \
  --httpListenAddr=0.0.0.0:8428

Note: The --retentionPeriod=12 ensures one year of history.

🔍Scraping Configuration (prometheus.yml)

VictoriaMetrics uses the standard Prometheus YAML format for its scraper. The file was copied from the older Prometheus host Pineapple and copied to:

sudo nano /etc/victoriametrics/prometheus.yml

Key Change: The evaluation_interval directive was removed as it is not natively supported by the VictoriaMetrics single-binary scraper (it expects vmalert for that).

🧪Target Jobs

The configuration includes the legacy fleet plus the new 2026 additions:

  • Infrastructure: Mango (Self), CTNS1.
  • Production:
    • Reverse proxy (Nginx) Raisin
    • Webservers (Apache) Plum, Satsuma, Fig
    • Database server (MySQL) Mandarin
  • New 2026 Hosts: Blackcurrant (Data & Archive), Quince (AI/Media), Tayberry (OpenAlex).
  • Gaming: Apple & Cherry (Minecraft Servers).
  • Terminals:
    • (NoMachine) Kiwiberry
    • (XRDP), Kapok
    • (Windows, RDP) Wahoo/Walnut .

Scrape Interval: Set to 120s to balance data resolution with disk I/O and longevity.

Adding the Dockge Targets

we had to update the /etc/victoriametrics/prometheus.yml to include the docker containers

 - job_name: 'docker_containers'
    static_configs:
      - targets:
          - 'quince.seaoffate.net:8080'      # cAdvisor for AI Stack
          - 'blackberry.seaoffate.net:8080' # cAdvisor for Data Archive
          - 'tayberry.seaoffate.net:8080'     # cAdvisor for OpenAlex
         
  - job_name: 'gpu_metrics'
    static_configs:
      - targets: ['quince.seaoffate.net:9835']
  - job_name: 'jellyfin'
    static_configs:
      - targets: ['quince.seaoffate.net:8082']

Target Agent Installation (Scrapers)

For Mango to collect data, each target VM must run a specific exporter. Most Linux hosts use the node_exporter for OS metrics, while application-specific exporters are used for Nginx, Apache, and MySQL.

Linux Node Exporter (Standard for all VMs)

Installed on all Linux hosts (Raisin, Plum, Satsuma, Apple, Cherry, etc.) to monitor CPU, RAM, and Disk. Any hosts that don't show on the targets webpage need to have the agent installed.

  • Install via APT
sudo apt update && sudo apt install -y prometheus-node-exporter
  • Enable and Start
sudo systemctl enable --now prometheus-node-exporter
  • Verification (Run on target VM)
curl http://localhost:9100/metrics
  • Firewall Requirement: Target VM must allow Inbound TCP 9100 from Mango (192.168.110.133).

Nginx (Raisin)

Used to monitor active connections and request rates. the Nginx exporter is a standalone binary that talks to Nginx's stub_status module.

  • Enable Nginx Status: On Raisin, edit the Nginx config (e.g., /etc/nginx/sites-available/default) and add this block:
server {
    listen 127.0.0.1:8080;
    location /metrics {
        stub_status on;
        access_log off;
        allow 127.0.0.1;
        deny all;
    }
}
sudo nginx -s reload
  • Install & Run Exporter
wget https://github.com/nginx/nginx-prometheus-exporter/releases/latest/download/nginx-prometheus-exporter_0.11.0_linux_amd64.tar.gz
tar -xvf nginx-prometheus-exporter_*.tar.gz
sudo mv nginx-prometheus-exporter /usr/local/bin/
  • Create a Systemd Service
sudo nano /etc/systemd/system/nginx-exporter.service

Paste this into the service file

[Unit]
Description=Nginx Prometheus Exporter
After=network.target
[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/bin/nginx-prometheus-exporter \
    -nginx.scrape-uri=http://127.0.0.1:8080/metrics
Restart=always
[Install]
WantedBy=multi-user.target

Enable the service with

sudo systemctl enable --now nginx-exporter

MySQL (Mandarin)

Used to monitor query throughput and database health.

  • Database User: Create a mysqld_exporter user in MySQL with PROCESS, REPLICATION CLIENT, SELECT privileges.
  • Configuration: Store credentials in /etc/.mysqld_exporter.cnf.
  • Service: Install prometheus-mysqld-exporter via APT.
  • Port: TCP 9104

Apache (Plum, Fig, Satsuma)

  • Enable Mod Status:
sudo a2enmod status.
  • Install Exporter:
sudo apt install prometheus-apache-exporter
  • Port: TCP 9117

Windows Exporter (Wahoo & Walnut)

For the Windows 11 desktops, we use the windows_exporter

  • Download: Latest .msi from the Prometheus Community GitHub.
  • Install: Run the installer; it defaults to port 9182.
  • Firewall: The installer typically adds a "Windows Firewall" exception automatically.

Docker & Container App Exporters

Since we are using Dockge to manage our containers on hosts like Quince (AI), Blackcurrant (Archive), and Tayberry (OpenAlex), we should standardize how metrics are pulled from these environments. The most efficient way to do this is to add cAdvisor to each of our Dockge stacks. This allows Mango to "see" inside the Docker engine of that specific VM and report on the health of every individual container (Ollama, Jellyfin, etc.).

Docker Container Monitoring (The Dockge Layer)

For every VM running Dockge, you need to add a Monitoring Stack or add these services to your existing stacks. cAdvisor is the primary agent here; it scrapes resource usage from the Docker socket.

  • Create a "Monitoring" Stack in Dockge: In the Dockge UI, create a new stack and use the following
    • docker-compose.yml
version: "3.8"
services:
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.47.0 # Use a version compatible with your kernel
    container_name: cadvisor
    restart: unless-stopped
    privileged: true
    ports:
      - "8080:8080"
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:rw
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
  # OPTIONAL: Only for Quince (AI Host) to monitor NVIDIA GPUs
  nv-exporter:
    image: utkuozdemir/nvidia_gpu_exporter
    container_name: nvidia_exporter
    restart: unless-stopped
    ports:
      - "9835:9835"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

Port Summary for Dockge Hosts (these ports must be opened in Pfsense for the exporters to report their status to Mango

  • 8080: cAdvisor (Container CPU/RAM/Network)
  • 9835: NVIDIA Exporter (GPU VRAM/Temp - Quince only)

Proxmox Host (Pear)

To monitor the physical hardware and ZFS pools:

  • Node Exporter: Installed directly on the Proxmox Debian host.
  • SMART Metrics: Use the smartctl_exporter_script.sh (as detailed in legacy notes) to pipe drive health into the node_exporter's textfile collector.

Post-Installation validation on Mango

After installing an agent on a target, confirm Mango sees it:

  • Open VMUI: http://mango:8428/targets.
  • Search for the hostname.
  • Status must be "UP". If "Connection Refused," check the service on the target; if "Timeout," check pfSense rules

📈 Grafana Installation

Grafana was installed on the same Mango host to provide the local visualization layer.

  • Repository & App Setup
sudo apt install -y apt-transport-https software-properties-common wget
wget -q -O - https://apt.grafana.com/gpg.key | gpg --dearmor | sudo tee /usr/share/keyrings/grafana.gpg > /dev/null
echo "deb [signed-by=/usr/share/keyrings/grafana.gpg] https://apt.grafana.com stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt update && sudo apt install grafana -y
sudo systemctl enable --now grafana-server

🧩Network & Firewall Rules (pfSense)

To allow the new ports to function, pfSense was updated with a new Alias for Monitoring Ports:

  • 3000: Grafana UI (remains the same from the previous Grafana installation)
  • 8428: VictoriaMetrics UI/API ( The new port added for viewing of the scraping progress as was done by Prometheus web gui)
  • 9090: removed the older Prometheus webgui port

Critical Rules

  • MGT -> Mango: Allow ports 3000 & 8428 (Access from Cinnamon or other management console).
  • Mango -> All Networks:
    • Allow port 9100 (Node)
    • Allow port 9113 (Nginx)
    • Allow port 9117 (Apache)
    • Allow port 9104 (MySQL)
    • Allow port 9182 (Windows)

🔦Verification Steps

  • Check Dockge: Ensure the cadvisor container shows as "Green/Running" in the Dockge UI.
  • Service Status: (Confirmed Active/Running).
sudo systemctl status victoriametrics

Targets Check: verify all hosts are Green/UP. We should see the new entries for ports 8080, 9835, etc for the docker containers

http://mango:8428/targets

Data Source: In Grafana, added Prometheus data source pointing to

http://localhost:8428.

Disk Write Check: confirms ingestion of samples to the PearPool disk.

du -sh /mnt/metrics_data


Summary of Legacy Retirement

With Mango fully operational:

  • Pineapple (.130) services stopped.
  • Granadilla (.131) services stopped.
  • Victoria (.132) services stopped.
  • Lychee identified as legacy and marked for rebuild via new Gold Master Template.

Build Complete: February 22, 2026