The Web Archive (ArchiveBox): Difference between revisions
Wikisailor (talk | contribs) |
Wikisailor (talk | contribs) Β |
||
| (3 intermediate revisions by the same user not shown) | |||
| Line 5: | Line 5: | ||
* Synergy: Use OpenAlex to find a paper, use Kiwix for general encyclopedia background, and use ArchiveBox to save the specific blog posts or project wikis that support your work. | * Synergy: Use OpenAlex to find a paper, use Kiwix for general encyclopedia background, and use ArchiveBox to save the specific blog posts or project wikis that support your work. | ||
==π―The Best Format== | ==Downloading files== | ||
By default the Archivebox gets as many formats as possible, that is great if there is petabytes of storage available but in this case as there is only 5TB (only is a bit ironic as that is massive by only a few years ago) we would be better to limit the downloads to a few formats to suit the most likely eventualities. | |||
===π―The Best Format=== | |||
Markdown & Clean HTML are the best formats to store. For a researcher, Markdown is the king of formats because it is readable by humans (on a phone or PC) and is the native "language" of LLMs. | Markdown & Clean HTML are the best formats to store. For a researcher, Markdown is the king of formats because it is readable by humans (on a phone or PC) and is the native "language" of LLMs. | ||
| Line 18: | Line 21: | ||
|- | |- | ||
|PDF || β
Good || β Poor || Great for you to look at, but a nightmare for AI to read (tables and columns break). | |PDF || β
Good || β Poor || Great for you to look at, but a nightmare for AI to read (tables and columns break). | ||
|- | |||
|Readability (content.html) || β
Good || β
Good || Cleaned text version of the site (like "Reader Mode"). | |||
|- | |||
|Wget/Media (warc/, screenshot.png)|| || || The Heavy Stuff. Huge folders of raw site assets and | |||
|} | |} | ||
Our Strategy | ===Our Strategy=== | ||
* ArchiveBox saves in multiple formats simultaneously. When | * ArchiveBox saves in multiple formats simultaneously. When we point AnythingLLM to your archive, we should tell it to look specifically at the article.md or singlefile.html versions to save on "token" costs and improve accuracy. To save space on our 5TB disk, we can tell ArchiveBox to ignore the heavy "Wget" and "Screenshot" methods. | ||
====For a single Add command==== | |||
For a single Add command Use the EXTRACTORS environment variable to only run the clean text and single-file tools: | |||
docker exec -it --user=archivebox -e EXTRACTORS='title,singlefile,readability,mercury,markdown' archivebox archivebox add 'https://go.dev/doc/' | |||
====Global Change (Recommended)==== | |||
Update the config so every future archive is small: | |||
docker exec -it --user=archivebox archivebox archivebox config --set FETCH_WGET=False | |||
docker exec -it --user=archivebox archivebox archivebox config --set FETCH_SCREENSHOT=False | |||
docker exec -it --user=archivebox archivebox archivebox config --set FETCH_PDF=False | |||
Alternatively the same can be set in the environment of the yaml files in dockge | |||
==πΎ The Infrastructure== | ==πΎ The Infrastructure== | ||
Latest revision as of 05:42, 10 February 2026
π Introduction
ArchiveBox is a self-hosted web archiving solution. Unlike a simple bookmark, it takes a "snapshot" of a page in multiple formats so that if the original site goes down, you still have the full content.
- The Outputs: For every URL you save, it creates a PDF, a Screenshot (PNG), a Single-File HTML, and a Wget clone.
- The Goal: To build a searchable, permanent record of the specific web resources you use for research, separate from the broad scale of OpenAlex.
- Synergy: Use OpenAlex to find a paper, use Kiwix for general encyclopedia background, and use ArchiveBox to save the specific blog posts or project wikis that support your work.
Downloading files
By default the Archivebox gets as many formats as possible, that is great if there is petabytes of storage available but in this case as there is only 5TB (only is a bit ironic as that is massive by only a few years ago) we would be better to limit the downloads to a few formats to suit the most likely eventualities.
π―The Best Format
Markdown & Clean HTML are the best formats to store. For a researcher, Markdown is the king of formats because it is readable by humans (on a phone or PC) and is the native "language" of LLMs.
| Format | Phone/PC Viewing | LLM Readability | Why it's a winner |
|---|---|---|---|
| Markdown (.md) | β Excellent (Any text editor) | β Best | Smallest file size, zero "noise" (no ads/scripts), perfect for AI context. |
| Single-File HTML | β Excellent (Any browser) | β High | Preserves images and layout for you, while the LLM can still parse the text. |
| β Good | β Poor | Great for you to look at, but a nightmare for AI to read (tables and columns break). | |
| Readability (content.html) | β Good | β Good | Cleaned text version of the site (like "Reader Mode"). |
| Wget/Media (warc/, screenshot.png) | The Heavy Stuff. Huge folders of raw site assets and |
Our Strategy
- ArchiveBox saves in multiple formats simultaneously. When we point AnythingLLM to your archive, we should tell it to look specifically at the article.md or singlefile.html versions to save on "token" costs and improve accuracy. To save space on our 5TB disk, we can tell ArchiveBox to ignore the heavy "Wget" and "Screenshot" methods.
For a single Add command
For a single Add command Use the EXTRACTORS environment variable to only run the clean text and single-file tools:
docker exec -it --user=archivebox -e EXTRACTORS='title,singlefile,readability,mercury,markdown' archivebox archivebox add 'https://go.dev/doc/'
Global Change (Recommended)
Update the config so every future archive is small:
docker exec -it --user=archivebox archivebox archivebox config --set FETCH_WGET=False docker exec -it --user=archivebox archivebox archivebox config --set FETCH_SCREENSHOT=False docker exec -it --user=archivebox archivebox archivebox config --set FETCH_PDF=False
Alternatively the same can be set in the environment of the yaml files in dockge
πΎ The Infrastructure
ArchiveBox is heavy on disk I/O and storage, which is why it gets the dedicated 5TB drive.
- Host: Blackberry (Proxmox VM)
- Compute: Uses the same 4 Cores / 6GB RAM as the rest of the stack.
- Storage: 5TB XFS disk mounted at /mnt/archive_data/archivebox and and an additional 4TB XFS disk mouted at /mnt/docker_data for use with The Kiwix Archive
Note: ArchiveBox can grow very fast (approx. 1GB per 1000 articles).
π The Software Stack (Docker)
Installing Docker & Compose
Before installing Dockge, we must install the Docker engine and the Compose plugin officially on Debian.
# Update and install dependencies sudo apt update && sudo apt install -y ca-certificates curl gnupg # Add Dockerβs official GPG key sudo install -m 0755 -d /etc/apt/keyrings sudo curl -fsSL https://download.docker.com/linux/debian/gpg -o /etc/apt/keyrings/docker.asc sudo chmod a+r /etc/apt/keyrings/docker.asc # Add the repository to Apt sources echo \ "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/debian \ $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \ sudo tee /etc/apt/sources.list.d/docker.list > /dev/null # Install Docker Engine and Compose Plugin sudo apt update sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin # Optional: Allow your user to run docker without sudo sudo usermod -aG docker $USER
π οΈ Installing Dockge
Dockge allows us to manage our "Stacks" (Docker Compose files) through a clean web interface.
# Preparation: Create directories mkdir -p /opt/stacks /opt/dockge cd /opt/dockge # Download and Start Dockge curl https://raw.githubusercontent.com/louislam/dockge/master/compose.yaml --output compose.yaml docker compose up -d
π οΈ Preparation: Storage Folders
Organize The ZIM files on the 5TB disk so the container can find them easily.
# Create the directory on the 5TB Archive disk mkdir -p /mnt/archive_data/archivebox/data # Set permissions so Docker can initialize the DB sudo chown -R 911:911 /mnt/archive_data/archivebox/data
Initializing the Archive
Before running the Web UI, ArchiveBox needs to build its internal database. Run this one-time command:
docker run -v /mnt/archive_data/archivebox/data:/data -it archivebox/archivebox init --setup
(Follow the prompts to create your admin username and password.)
π ArchiveBox YAML (The Stack)
Deploy this in Dockge on Blackberry and name it archivebox:
version: "3.9"
services:
archivebox:
image: archivebox/archivebox:latest
container_name: archivebox
ports:
- 8082:8000
volumes:
- /mnt/archive_data/archivebox:/data
environment:
- ALLOW_EXTERNAL_CONFIG=True
- ADMIN_USERNAME=nigel
- ADMIN_PASSWORD=blackberry_archive_2026
- PUBLIC_ADD_VIEW=True
- ALLOWED_HOSTS=*
- REST_API_ENABLED=True
- REST_API_USER=nigel
- REST_API_PASS=netgear1
- CSRF_TRUSTED_ORIGINS=http://192.168.100.85:5678,http://192.168.100.85:8082
- CSRF_COOKIE_SECURE=False
- CSRF_IGNORE_PORT=True
# --- THE LEAN LIST (Enabled) ---
- SAVE_MARKDOWN=True # \U0001f3c6 THE BEST for AnythingLLM
- SAVE_MERCURY=True # \U0001f4d6 Clean article text
- SAVE_SINGLEFILE=True # \U0001f4f1 Best for viewing on phone
- SAVE_DOM=True # \U0001f3d7\ufe0f Good fallback for LLM context
# --- THE HEAVY LIST (Disabled) ---
- SAVE_SCREENSHOT=False # \U0001f5bc\ufe0f High disk usage, zero LLM value
- SAVE_PDF=False # \U0001f4c4 Hard for LLMs to parse, high disk usage
- SAVE_WARC=False # \U0001f4e6 Massive container files for librarians
- SAVE_MEDIA=False # \U0001f3ac No videos/audio (saves TBs of space)
- SAVE_GIT=False # \U0001f4bb No full code repo clones
- SAVE_ARCHIVE_DOT_ORG=False # \U0001f310 No external submission
- SAVE_HEADERS=False # \U0001f4e1 No technical log files
restart: unless-stopped
networks: {}
π Accessing and Using the Archive
| Tool | URL | Purpose |
|---|---|---|
| Web UI | http://blackberry:8000 | Browse, search, and add new URLs to your archive. |
| Admin Panel | http://blackberry:8000/admin | Manage users and deep configuration. |
π How to Add Content
- Web UI: Click the "Add +" button at the top right and paste your URLs.
- CLI (Terminal): If you have a text file of URLs, you can pipe them in:
cat my_links.txt | docker exec -i archivebox archivebox add
π Complete Blackberry Port Map
| Port | Name | Purpose |
|---|---|---|
| 5001 | Dockge | Docker Management |
| 8000 | ArchiveBox | Personal Web Archive |
| 8081 | Kiwix | Offline Wikipedia/Encyclopedias) |