The Kiwix Archive: Difference between revisions
Wikisailor (talk | contribs) No edit summary |
Wikisailor (talk | contribs) ย |
||
| (9 intermediate revisions by the same user not shown) | |||
| Line 3: | Line 3: | ||
* The Format: It uses highly compressed .ZIM files. A single file can contain the entirety of Wikipedia (with images) or the complete medical encyclopedia. | * The Format: It uses highly compressed .ZIM files. A single file can contain the entirety of Wikipedia (with images) or the complete medical encyclopedia. | ||
* The Goal: To provide a permanent, offline knowledge base that remains accessible even if the internet is down, serving everyone on your local network. | * The Goal: To provide a permanent, offline knowledge base that remains accessible even if the internet is down, serving everyone on your local network. | ||
* Synergy: Works alongside OpenAlex (scholarly search) and ArchiveBox (personal web snapshots) to create a three-tier local research library. | * Synergy: Works alongside '''[[OpenAlex]]''' (scholarly search) and '''[[The Web Archive (ArchiveBox) | ArchiveBox]]''' (personal web snapshots) to create a three-tier local research library. | ||
ย | |||
Once this collection of Zims are downloaded we can also extract the text from the files to create a collection of .MD files which can then be used as a Retrieval-Augmented Generation database or RAG database for offline LLMs or the same data could be used as fine tuning data for o0ne or more LLMs. Although some of the .MDs will be general knowledge and only really useful for the RAG DB some will detailed question and answer tables of condensed information for example the Slack Overflow will be high density data sets | |||
==๐พ The Infrastructure == | ==๐พ The Infrastructure == | ||
| Line 10: | Line 12: | ||
* VM Config: Debian | 4 Cores | 6GB RAM. | * VM Config: Debian | 4 Cores | 6GB RAM. | ||
* Storage: 4TB XFS disk mounted at /mnt/docker_data and an additional 5TB XFS disk for ArchiveBox /mnt/archive_data | * Storage: 4TB XFS disk mounted at /mnt/docker_data and an additional 5TB XFS disk for ArchiveBox /mnt/archive_data | ||
==[[Linux Commands]]== | |||
A set of Linux commands to help show the progress of indexing and file copying | |||
==๐ The Software Stack (Docker)== | ==๐ The Software Stack (Docker)== | ||
=== Installing Docker & Compose=== | |||
Before installing Dockge, we must install the Docker engine and the Compose plugin officially on Debian. | |||
# Update and install dependencies | |||
sudo apt update && sudo apt install -y ca-certificates curl gnupg | |||
# Add Dockerโs official GPG key | |||
sudo install -m 0755 -d /etc/apt/keyrings | |||
sudo curl -fsSL https://download.docker.com/linux/debian/gpg -o /etc/apt/keyrings/docker.asc | |||
sudo chmod a+r /etc/apt/keyrings/docker.asc | |||
# Add the repository to Apt sources | |||
echo \ | |||
ย "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/debian \ | |||
ย $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \ | |||
ย sudo tee /etc/apt/sources.list.d/docker.list > /dev/null | |||
# Install Docker Engine and Compose Plugin | |||
sudo apt update | |||
sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin | |||
# Optional: Allow your user to run docker without sudo | |||
sudo usermod -aG docker $USER | |||
===๐ ๏ธย Installing Dockge=== | ===๐ ๏ธย Installing Dockge=== | ||
| Line 42: | Line 68: | ||
ย ย ย restart: unless-stopped | ย ย ย restart: unless-stopped | ||
ย networks: {} | ย networks: {} | ||
===๐ Accessing and Using the Library=== | |||
{| class="wikitable" | |||
|- | |||
! Tool !! URL !! Purpose | |||
|- | |||
|Kiwix Web UI || http://blackberry:8080 || Browsing your downloaded offline libraries. | |||
|- | |||
|ZIM Library || https://kiwix.org/en/download || Where to download new content (Wikipedia, StackOverflow, etc.). | |||
|- | |||
| German Zim Library || https://ftp.fau.de/kiwix/zim/ || lists of zim files. | |||
|- | |||
| Kewix org || https://download.kiwix.org/zim/ || More Lists of Zim files | |||
|} | |||
===Indexing Files after download=== | |||
There are three helper scripts in the /mnt/docker_data/stacks/kiwix-archive/zim/ directory that will help add new zim files to the xml index | |||
{| class="wikitable" | |||
|- | |||
! Filename !!ย Purpose | |||
|- | |||
|./audit_incomplete.sh || scans zim dir for incomplete files and move them to incomplete directory | |||
|- | |||
|./sweep_parts.sh || Checks any files in the incomplete directory for files that have corresponding completed downloaded Zim | |||
|- | |||
|./index_vault.sh || Adds new Zim to XML index file | |||
|} | |||
Latest revision as of 20:29, 20 April 2026
๐ Introduction
Kiwix is an offline content reader that allows you to browse massive websitesโlike Wikipedia, StackExchange, or Project Gutenbergโwithout an internet connection.
- The Format: It uses highly compressed .ZIM files. A single file can contain the entirety of Wikipedia (with images) or the complete medical encyclopedia.
- The Goal: To provide a permanent, offline knowledge base that remains accessible even if the internet is down, serving everyone on your local network.
- Synergy: Works alongside OpenAlex (scholarly search) and ArchiveBox (personal web snapshots) to create a three-tier local research library.
Once this collection of Zims are downloaded we can also extract the text from the files to create a collection of .MD files which can then be used as a Retrieval-Augmented Generation database or RAG database for offline LLMs or the same data could be used as fine tuning data for o0ne or more LLMs. Although some of the .MDs will be general knowledge and only really useful for the RAG DB some will detailed question and answer tables of condensed information for example the Slack Overflow will be high density data sets
๐พ The Infrastructure
Blackberry has been slimmed down to be more efficient now that indexing is handled elsewhere.
- Host: Blackberry
- VM Config: Debian | 4 Cores | 6GB RAM.
- Storage: 4TB XFS disk mounted at /mnt/docker_data and an additional 5TB XFS disk for ArchiveBox /mnt/archive_data
Linux Commands
A set of Linux commands to help show the progress of indexing and file copying
๐ The Software Stack (Docker)
Installing Docker & Compose
Before installing Dockge, we must install the Docker engine and the Compose plugin officially on Debian.
# Update and install dependencies sudo apt update && sudo apt install -y ca-certificates curl gnupg # Add Dockerโs official GPG key sudo install -m 0755 -d /etc/apt/keyrings sudo curl -fsSL https://download.docker.com/linux/debian/gpg -o /etc/apt/keyrings/docker.asc sudo chmod a+r /etc/apt/keyrings/docker.asc # Add the repository to Apt sources echo \ "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/debian \ $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \ sudo tee /etc/apt/sources.list.d/docker.list > /dev/null # Install Docker Engine and Compose Plugin sudo apt update sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin # Optional: Allow your user to run docker without sudo sudo usermod -aG docker $USER
๐ ๏ธ Installing Dockge
Dockge allows us to manage our "Stacks" (Docker Compose files) through a clean web interface.
# Preparation: Create directories mkdir -p /opt/stacks /opt/dockge cd /opt/dockge # Download and Start Dockge curl https://raw.githubusercontent.com/louislam/dockge/master/compose.yaml --output compose.yaml docker compose up -d
๐ ๏ธ Preparation: Storage Folders
Organize The ZIM files on the 5TB disk so the container can find them easily.
mkdir -p /mnt/docker_data/stacks/kiwix-archive/zim/
๐ Kiwix YAML (The Stack)
Deploy this in your Dockge instance on Blackberry (Port 5001) and name it kiwix
services:
kiwix:
image: ghcr.io/kiwix/kiwix-serve:latest
container_name: kiwix_wikipedia
volumes:
- /mnt/docker_data/stacks/kiwix-archive/zim:/data
ports:
- 8081:8080
command:
- --library
- library.xml
restart: unless-stopped
networks: {}
๐ Accessing and Using the Library
| Tool | URL | Purpose |
|---|---|---|
| Kiwix Web UI | http://blackberry:8080 | Browsing your downloaded offline libraries. |
| ZIM Library | https://kiwix.org/en/download | Where to download new content (Wikipedia, StackOverflow, etc.). |
| German Zim Library | https://ftp.fau.de/kiwix/zim/ | lists of zim files. |
| Kewix org | https://download.kiwix.org/zim/ | More Lists of Zim files |
Indexing Files after download
There are three helper scripts in the /mnt/docker_data/stacks/kiwix-archive/zim/ directory that will help add new zim files to the xml index
| Filename | Purpose |
|---|---|
| ./audit_incomplete.sh | scans zim dir for incomplete files and move them to incomplete directory |
| ./sweep_parts.sh | Checks any files in the incomplete directory for files that have corresponding completed downloaded Zim |
| ./index_vault.sh | Adds new Zim to XML index file |