The Web Archive (ArchiveBox)

From Sea of Fate
Jump to navigationJump to search

πŸ“– Introduction

ArchiveBox is a self-hosted web archiving solution. Unlike a simple bookmark, it takes a "snapshot" of a page in multiple formats so that if the original site goes down, you still have the full content.

  • The Outputs: For every URL you save, it creates a PDF, a Screenshot (PNG), a Single-File HTML, and a Wget clone.
  • The Goal: To build a searchable, permanent record of the specific web resources you use for research, separate from the broad scale of OpenAlex.
  • Synergy: Use OpenAlex to find a paper, use Kiwix for general encyclopedia background, and use ArchiveBox to save the specific blog posts or project wikis that support your work.

🎯The Best Format

Markdown & Clean HTML are the best formats to store. For a researcher, Markdown is the king of formats because it is readable by humans (on a phone or PC) and is the native "language" of LLMs.

Format Phone/PC Viewing LLM Readability Why it's a winner
Markdown (.md) βœ… Excellent (Any text editor) ⭐ Best Smallest file size, zero "noise" (no ads/scripts), perfect for AI context.
Single-File HTML βœ… Excellent (Any browser) βœ… High Preserves images and layout for you, while the LLM can still parse the text.
PDF βœ… Good ❌ Poor Great for you to look at, but a nightmare for AI to read (tables and columns break).

πŸ’Ύ The Infrastructure

ArchiveBox is heavy on disk I/O and storage, which is why it gets the dedicated 5TB drive.

  • Host: Blackberry (Proxmox VM)
  • Compute: Uses the same 4 Cores / 6GB RAM as the rest of the stack.
  • Storage: 5TB XFS disk mounted at /mnt/archive_data/archivebox and and an additional 4TB XFS disk mouted at /mnt/docker_data for use with The Kiwix Archive

Note: ArchiveBox can grow very fast (approx. 1GB per 1000 articles).

πŸ‹ The Software Stack (Docker)

Installing Docker & Compose

Before installing Dockge, we must install the Docker engine and the Compose plugin officially on Debian.

# Update and install dependencies
sudo apt update && sudo apt install -y ca-certificates curl gnupg
# Add Docker’s official GPG key
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/debian/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc
# Add the repository to Apt sources
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/debian \
  $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
# Install Docker Engine and Compose Plugin
sudo apt update
sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
# Optional: Allow your user to run docker without sudo
sudo usermod -aG docker $USER

πŸ› οΈ Installing Dockge

Dockge allows us to manage our "Stacks" (Docker Compose files) through a clean web interface.

# Preparation: Create directories
mkdir -p /opt/stacks /opt/dockge
cd /opt/dockge
# Download and Start Dockge
curl https://raw.githubusercontent.com/louislam/dockge/master/compose.yaml --output compose.yaml
docker compose up -d

πŸ› οΈ Preparation: Storage Folders

Organize The ZIM files on the 5TB disk so the container can find them easily.

# Create the directory on the 5TB Archive disk
mkdir -p /mnt/archive_data/archivebox/data
# Set permissions so Docker can initialize the DB
sudo chown -R 911:911 /mnt/archive_data/archivebox/data

Initializing the Archive

Before running the Web UI, ArchiveBox needs to build its internal database. Run this one-time command:

docker run -v /mnt/archive_data/archivebox/data:/data -it archivebox/archivebox init --setup

(Follow the prompts to create your admin username and password.)

πŸ“„ ArchiveBox YAML (The Stack)

Deploy this in Dockge on Blackberry and name it archivebox:

version: "3.9"
services:
  archivebox:
    image: archivebox/archivebox:latest
    container_name: archivebox
    ports:
      - 8082:8000
    volumes:
      - /mnt/archive_data/archivebox:/data
    environment:
      - ALLOW_EXTERNAL_CONFIG=True
      - ADMIN_USERNAME=nigel
      - ADMIN_PASSWORD=blackberry_archive_2026
      - PUBLIC_ADD_VIEW=True
      - ALLOWED_HOSTS=*
      - REST_API_ENABLED=True
      - REST_API_USER=nigel
      - REST_API_PASS=netgear1
      - CSRF_TRUSTED_ORIGINS=http://192.168.100.85:5678,http://192.168.100.85:8082
      - CSRF_COOKIE_SECURE=False
      - CSRF_IGNORE_PORT=True
      # --- THE LEAN LIST (Enabled) ---
      - SAVE_MARKDOWN=True # \U0001f3c6 THE BEST for AnythingLLM
      - SAVE_MERCURY=True # \U0001f4d6 Clean article text
      - SAVE_SINGLEFILE=True # \U0001f4f1 Best for viewing on phone
      - SAVE_DOM=True # \U0001f3d7\ufe0f Good fallback for LLM context
      # --- THE HEAVY LIST (Disabled) ---
      - SAVE_SCREENSHOT=False # \U0001f5bc\ufe0f High disk usage, zero LLM value
      - SAVE_PDF=False # \U0001f4c4 Hard for LLMs to parse, high disk usage
      - SAVE_WARC=False # \U0001f4e6 Massive container files for librarians
      - SAVE_MEDIA=False # \U0001f3ac No videos/audio (saves TBs of space)
      - SAVE_GIT=False # \U0001f4bb No full code repo clones
      - SAVE_ARCHIVE_DOT_ORG=False # \U0001f310 No external submission
      - SAVE_HEADERS=False # \U0001f4e1 No technical log files
    restart: unless-stopped
networks: {}

🌐 Accessing and Using the Archive

Tool URL Purpose
Web UI http://blackberry:8000 Browse, search, and add new URLs to your archive.
Admin Panel http://blackberry:8000/admin Manage users and deep configuration.

πŸš€ How to Add Content

  • Web UI: Click the "Add +" button at the top right and paste your URLs.
  • CLI (Terminal): If you have a text file of URLs, you can pipe them in:
cat my_links.txt | docker exec -i archivebox archivebox add

πŸ“‹ Complete Blackberry Port Map

Port Name Purpose
5001 Dockge Docker Management
8000 ArchiveBox Personal Web Archive
8081 Kiwix Offline Wikipedia/Encyclopedias)