The Web Archive (ArchiveBox): Difference between revisions
Wikisailor (talk | contribs) |
Wikisailor (talk | contribs) |
||
| Line 42: | Line 42: | ||
curl https://raw.githubusercontent.com/louislam/dockge/master/compose.yaml --output compose.yaml | curl https://raw.githubusercontent.com/louislam/dockge/master/compose.yaml --output compose.yaml | ||
docker compose up -d | docker compose up -d | ||
===🛠️ Preparation: Storage Folders=== | |||
Organize The ZIM files on the 5TB disk so the container can find them easily. | |||
# Create the directory on the 5TB Archive disk | |||
mkdir -p /mnt/archive_data/archivebox/data | |||
# Set permissions so Docker can initialize the DB | |||
sudo chown -R 911:911 /mnt/archive_data/archivebox/data | |||
===Initializing the Archive=== | |||
Before running the Web UI, ArchiveBox needs to build its internal database. Run this one-time command: | |||
docker run -v /mnt/archive_data/archivebox/data:/data -it archivebox/archivebox init --setup | |||
(Follow the prompts to create your admin username and password.) | |||
===📄 ArchiveBox YAML (The Stack)=== | |||
Deploy this in Dockge on Blackberry and name it archivebox: | |||
version: "3.9" | |||
services: | |||
archivebox: | |||
image: archivebox/archivebox:latest | |||
container_name: archivebox | |||
ports: | |||
- 8082:8000 | |||
volumes: | |||
- /mnt/archive_data/archivebox:/data | |||
environment: | |||
- ALLOW_EXTERNAL_CONFIG=True | |||
- ADMIN_USERNAME=nigel | |||
- ADMIN_PASSWORD=blackberry_archive_2026 | |||
- PUBLIC_ADD_VIEW=True | |||
- ALLOWED_HOSTS=* | |||
- REST_API_ENABLED=True | |||
- REST_API_USER=nigel | |||
- REST_API_PASS=netgear1 | |||
- CSRF_TRUSTED_ORIGINS=http://192.168.100.85:5678,http://192.168.100.85:8082 | |||
- CSRF_COOKIE_SECURE=False | |||
- CSRF_IGNORE_PORT=True | |||
# --- THE LEAN LIST (Enabled) --- | |||
- SAVE_MARKDOWN=True # \U0001f3c6 THE BEST for AnythingLLM | |||
- SAVE_MERCURY=True # \U0001f4d6 Clean article text | |||
- SAVE_SINGLEFILE=True # \U0001f4f1 Best for viewing on phone | |||
- SAVE_DOM=True # \U0001f3d7\ufe0f Good fallback for LLM context | |||
# --- THE HEAVY LIST (Disabled) --- | |||
- SAVE_SCREENSHOT=False # \U0001f5bc\ufe0f High disk usage, zero LLM value | |||
- SAVE_PDF=False # \U0001f4c4 Hard for LLMs to parse, high disk usage | |||
- SAVE_WARC=False # \U0001f4e6 Massive container files for librarians | |||
- SAVE_MEDIA=False # \U0001f3ac No videos/audio (saves TBs of space) | |||
- SAVE_GIT=False # \U0001f4bb No full code repo clones | |||
- SAVE_ARCHIVE_DOT_ORG=False # \U0001f310 No external submission | |||
- SAVE_HEADERS=False # \U0001f4e1 No technical log files | |||
restart: unless-stopped | |||
networks: {} | |||
Revision as of 08:29, 9 February 2026
📖 Introduction
ArchiveBox is a self-hosted web archiving solution. Unlike a simple bookmark, it takes a "snapshot" of a page in multiple formats so that if the original site goes down, you still have the full content.
- The Outputs: For every URL you save, it creates a PDF, a Screenshot (PNG), a Single-File HTML, and a Wget clone.
- The Goal: To build a searchable, permanent record of the specific web resources you use for research, separate from the broad scale of OpenAlex.
- Synergy: Use OpenAlex to find a paper, use Kiwix for general encyclopedia background, and use ArchiveBox to save the specific blog posts or project wikis that support your work.
💾 The Infrastructure
ArchiveBox is heavy on disk I/O and storage, which is why it gets the dedicated 5TB drive.
- Host: Blackberry (Proxmox VM)
- Compute: Uses the same 4 Cores / 6GB RAM as the rest of the stack.
- Storage: 5TB XFS disk mounted at /mnt/archive_data/archivebox and and an additional 4TB XFS disk mouted at /mnt/docker_data for use with The Kiwix Archive
Note: ArchiveBox can grow very fast (approx. 1GB per 1000 articles).
🐋 The Software Stack (Docker)
Installing Docker & Compose
Before installing Dockge, we must install the Docker engine and the Compose plugin officially on Debian.
# Update and install dependencies sudo apt update && sudo apt install -y ca-certificates curl gnupg # Add Docker’s official GPG key sudo install -m 0755 -d /etc/apt/keyrings sudo curl -fsSL https://download.docker.com/linux/debian/gpg -o /etc/apt/keyrings/docker.asc sudo chmod a+r /etc/apt/keyrings/docker.asc # Add the repository to Apt sources echo \ "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/debian \ $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \ sudo tee /etc/apt/sources.list.d/docker.list > /dev/null # Install Docker Engine and Compose Plugin sudo apt update sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin # Optional: Allow your user to run docker without sudo sudo usermod -aG docker $USER
🛠️ Installing Dockge
Dockge allows us to manage our "Stacks" (Docker Compose files) through a clean web interface.
# Preparation: Create directories mkdir -p /opt/stacks /opt/dockge cd /opt/dockge # Download and Start Dockge curl https://raw.githubusercontent.com/louislam/dockge/master/compose.yaml --output compose.yaml docker compose up -d
🛠️ Preparation: Storage Folders
Organize The ZIM files on the 5TB disk so the container can find them easily.
# Create the directory on the 5TB Archive disk mkdir -p /mnt/archive_data/archivebox/data # Set permissions so Docker can initialize the DB sudo chown -R 911:911 /mnt/archive_data/archivebox/data
Initializing the Archive
Before running the Web UI, ArchiveBox needs to build its internal database. Run this one-time command:
docker run -v /mnt/archive_data/archivebox/data:/data -it archivebox/archivebox init --setup
(Follow the prompts to create your admin username and password.)
📄 ArchiveBox YAML (The Stack)
Deploy this in Dockge on Blackberry and name it archivebox:
version: "3.9"
services:
archivebox:
image: archivebox/archivebox:latest
container_name: archivebox
ports:
- 8082:8000
volumes:
- /mnt/archive_data/archivebox:/data
environment:
- ALLOW_EXTERNAL_CONFIG=True
- ADMIN_USERNAME=nigel
- ADMIN_PASSWORD=blackberry_archive_2026
- PUBLIC_ADD_VIEW=True
- ALLOWED_HOSTS=*
- REST_API_ENABLED=True
- REST_API_USER=nigel
- REST_API_PASS=netgear1
- CSRF_TRUSTED_ORIGINS=http://192.168.100.85:5678,http://192.168.100.85:8082
- CSRF_COOKIE_SECURE=False
- CSRF_IGNORE_PORT=True
# --- THE LEAN LIST (Enabled) ---
- SAVE_MARKDOWN=True # \U0001f3c6 THE BEST for AnythingLLM
- SAVE_MERCURY=True # \U0001f4d6 Clean article text
- SAVE_SINGLEFILE=True # \U0001f4f1 Best for viewing on phone
- SAVE_DOM=True # \U0001f3d7\ufe0f Good fallback for LLM context
# --- THE HEAVY LIST (Disabled) ---
- SAVE_SCREENSHOT=False # \U0001f5bc\ufe0f High disk usage, zero LLM value
- SAVE_PDF=False # \U0001f4c4 Hard for LLMs to parse, high disk usage
- SAVE_WARC=False # \U0001f4e6 Massive container files for librarians
- SAVE_MEDIA=False # \U0001f3ac No videos/audio (saves TBs of space)
- SAVE_GIT=False # \U0001f4bb No full code repo clones
- SAVE_ARCHIVE_DOT_ORG=False # \U0001f310 No external submission
- SAVE_HEADERS=False # \U0001f4e1 No technical log files
restart: unless-stopped
networks: {}