The Web Archive (ArchiveBox): Difference between revisions

From Sea of Fate
Jump to navigationJump to search
Line 19: Line 19:
|PDF || βœ… Good || ❌ Poor || Great for you to look at, but a nightmare for AI to read (tables and columns break).
|PDF || βœ… Good || ❌ Poor || Great for you to look at, but a nightmare for AI to read (tables and columns break).
|}
|}
Our Strategy:
* ArchiveBox saves in multiple formats simultaneously. When you point AnythingLLM to your archive, you should tell it to look specifically at the article.md or singlefile.html versions to save on "token" costs and improve accuracy


==πŸ’Ύ The Infrastructure==
==πŸ’Ύ The Infrastructure==

Revision as of 08:51, 9 February 2026

πŸ“– Introduction

ArchiveBox is a self-hosted web archiving solution. Unlike a simple bookmark, it takes a "snapshot" of a page in multiple formats so that if the original site goes down, you still have the full content.

  • The Outputs: For every URL you save, it creates a PDF, a Screenshot (PNG), a Single-File HTML, and a Wget clone.
  • The Goal: To build a searchable, permanent record of the specific web resources you use for research, separate from the broad scale of OpenAlex.
  • Synergy: Use OpenAlex to find a paper, use Kiwix for general encyclopedia background, and use ArchiveBox to save the specific blog posts or project wikis that support your work.

🎯The Best Format

Markdown & Clean HTML are the best formats to store. For a researcher, Markdown is the king of formats because it is readable by humans (on a phone or PC) and is the native "language" of LLMs.

Format Phone/PC Viewing LLM Readability Why it's a winner
Markdown (.md) βœ… Excellent (Any text editor) ⭐ Best Smallest file size, zero "noise" (no ads/scripts), perfect for AI context.
Single-File HTML βœ… Excellent (Any browser) βœ… High Preserves images and layout for you, while the LLM can still parse the text.
PDF βœ… Good ❌ Poor Great for you to look at, but a nightmare for AI to read (tables and columns break).

Our Strategy:

  • ArchiveBox saves in multiple formats simultaneously. When you point AnythingLLM to your archive, you should tell it to look specifically at the article.md or singlefile.html versions to save on "token" costs and improve accuracy

πŸ’Ύ The Infrastructure

ArchiveBox is heavy on disk I/O and storage, which is why it gets the dedicated 5TB drive.

  • Host: Blackberry (Proxmox VM)
  • Compute: Uses the same 4 Cores / 6GB RAM as the rest of the stack.
  • Storage: 5TB XFS disk mounted at /mnt/archive_data/archivebox and and an additional 4TB XFS disk mouted at /mnt/docker_data for use with The Kiwix Archive

Note: ArchiveBox can grow very fast (approx. 1GB per 1000 articles).

πŸ‹ The Software Stack (Docker)

Installing Docker & Compose

Before installing Dockge, we must install the Docker engine and the Compose plugin officially on Debian.

# Update and install dependencies
sudo apt update && sudo apt install -y ca-certificates curl gnupg
# Add Docker’s official GPG key
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/debian/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc
# Add the repository to Apt sources
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/debian \
  $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
# Install Docker Engine and Compose Plugin
sudo apt update
sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
# Optional: Allow your user to run docker without sudo
sudo usermod -aG docker $USER

πŸ› οΈ Installing Dockge

Dockge allows us to manage our "Stacks" (Docker Compose files) through a clean web interface.

# Preparation: Create directories
mkdir -p /opt/stacks /opt/dockge
cd /opt/dockge
# Download and Start Dockge
curl https://raw.githubusercontent.com/louislam/dockge/master/compose.yaml --output compose.yaml
docker compose up -d

πŸ› οΈ Preparation: Storage Folders

Organize The ZIM files on the 5TB disk so the container can find them easily.

# Create the directory on the 5TB Archive disk
mkdir -p /mnt/archive_data/archivebox/data
# Set permissions so Docker can initialize the DB
sudo chown -R 911:911 /mnt/archive_data/archivebox/data

Initializing the Archive

Before running the Web UI, ArchiveBox needs to build its internal database. Run this one-time command:

docker run -v /mnt/archive_data/archivebox/data:/data -it archivebox/archivebox init --setup

(Follow the prompts to create your admin username and password.)

πŸ“„ ArchiveBox YAML (The Stack)

Deploy this in Dockge on Blackberry and name it archivebox:

version: "3.9"
services:
  archivebox:
    image: archivebox/archivebox:latest
    container_name: archivebox
    ports:
      - 8082:8000
    volumes:
      - /mnt/archive_data/archivebox:/data
    environment:
      - ALLOW_EXTERNAL_CONFIG=True
      - ADMIN_USERNAME=nigel
      - ADMIN_PASSWORD=blackberry_archive_2026
      - PUBLIC_ADD_VIEW=True
      - ALLOWED_HOSTS=*
      - REST_API_ENABLED=True
      - REST_API_USER=nigel
      - REST_API_PASS=netgear1
      - CSRF_TRUSTED_ORIGINS=http://192.168.100.85:5678,http://192.168.100.85:8082
      - CSRF_COOKIE_SECURE=False
      - CSRF_IGNORE_PORT=True
      # --- THE LEAN LIST (Enabled) ---
      - SAVE_MARKDOWN=True # \U0001f3c6 THE BEST for AnythingLLM
      - SAVE_MERCURY=True # \U0001f4d6 Clean article text
      - SAVE_SINGLEFILE=True # \U0001f4f1 Best for viewing on phone
      - SAVE_DOM=True # \U0001f3d7\ufe0f Good fallback for LLM context
      # --- THE HEAVY LIST (Disabled) ---
      - SAVE_SCREENSHOT=False # \U0001f5bc\ufe0f High disk usage, zero LLM value
      - SAVE_PDF=False # \U0001f4c4 Hard for LLMs to parse, high disk usage
      - SAVE_WARC=False # \U0001f4e6 Massive container files for librarians
      - SAVE_MEDIA=False # \U0001f3ac No videos/audio (saves TBs of space)
      - SAVE_GIT=False # \U0001f4bb No full code repo clones
      - SAVE_ARCHIVE_DOT_ORG=False # \U0001f310 No external submission
      - SAVE_HEADERS=False # \U0001f4e1 No technical log files
    restart: unless-stopped
networks: {}

🌐 Accessing and Using the Archive

Tool URL Purpose
Web UI http://blackberry:8000 Browse, search, and add new URLs to your archive.
Admin Panel http://blackberry:8000/admin Manage users and deep configuration.

πŸš€ How to Add Content

  • Web UI: Click the "Add +" button at the top right and paste your URLs.
  • CLI (Terminal): If you have a text file of URLs, you can pipe them in:
cat my_links.txt | docker exec -i archivebox archivebox add

πŸ“‹ Complete Blackberry Port Map

Port Name Purpose
5001 Dockge Docker Management
8000 ArchiveBox Personal Web Archive
8081 Kiwix Offline Wikipedia/Encyclopedias)