OpenAlex: Difference between revisions
Wikisailor (talk | contribs) |
Wikisailor (talk | contribs) |
||
| Line 1: | Line 1: | ||
== | 🏗️ ==The OpenAlex Scholar Engine (Tayberry)== | ||
===📖 Introduction=== | |||
OpenAlex is a massive, open-source index of the world's scholarly research. It contains over 250 million "Works" (papers, books, etc.), along with millions of authors and institutions. | OpenAlex is a massive, open-source index of the world's scholarly research. It contains over 250 million "Works" (papers, books, etc.), along with millions of authors and institutions. | ||
* The Goal: To have a local, lightning-fast search engine on Tayberry that allows you to query this entire | * The Data Structure: More than just a list, OpenAlex is a heterogeneous directed graph. This means it maps complex relationships between different types of entities—linking authors to their papers, papers to their citations, and institutions to their researchers. This structure makes it an incredibly powerful tool for tracking the influence and evolution of scientific thought. | ||
* The Engine: We use OpenSearch, | * The Goal: To have a local, lightning-fast search engine on Tayberry that allows you to query this entire dataset without relying on public API limits. | ||
* | * The Engine: We use OpenSearch, a high-performance "big data" engine designed for complex filtering and full-text search. | ||
OpenAlex | * Synergy: OpenAlex serves as a deep data archive that works alongside Kiwix (offline Wikipedia/StackOverflow) and ArchiveBox to create a complete local research library. | ||
==The Infrastructure (Tayberry VM)== | ==The Infrastructure (Tayberry VM)== | ||
Revision as of 06:08, 9 February 2026
🏗️ ==The OpenAlex Scholar Engine (Tayberry)==
📖 Introduction
OpenAlex is a massive, open-source index of the world's scholarly research. It contains over 250 million "Works" (papers, books, etc.), along with millions of authors and institutions.
- The Data Structure: More than just a list, OpenAlex is a heterogeneous directed graph. This means it maps complex relationships between different types of entities—linking authors to their papers, papers to their citations, and institutions to their researchers. This structure makes it an incredibly powerful tool for tracking the influence and evolution of scientific thought.
- The Goal: To have a local, lightning-fast search engine on Tayberry that allows you to query this entire dataset without relying on public API limits.
- The Engine: We use OpenSearch, a high-performance "big data" engine designed for complex filtering and full-text search.
- Synergy: OpenAlex serves as a deep data archive that works alongside Kiwix (offline Wikipedia/StackOverflow) and ArchiveBox to create a complete local research library.
The Infrastructure (Tayberry VM)
We built a dedicated Virtual Machine named Tayberry to act as the "Search Host."
- OS: Ubuntu/Debian.
- Storage: A dedicated 5TB XFS-formatted disk (/mnt/openalex). This is crucial because OpenAlex data is over 300GB compressed, but expands significantly once indexed.
- Compute: 8 Cores and 24GB RAM to handle the heavy math of indexing.
The Software Stack (🐋Docker & Python)
Instead of installing complex software directly on the OS, we use Docker to keep it clean.
- Docker Container: Runs OpenSearch 3.4.0. It is configured with a 12GB Heap (RAM) to manage the data flow.
- Source Data: Located at /mnt/openalex/v2026/source_data/works. These are thousands of .gz files containing JSON data.
- The Indexer: A custom Python script (index_openalex.py) that acts as a "bridge." It unzips the files, reads the JSON, and pushes them into the OpenSearch engine in batches.
🛠️ Installing Dockge on Tayberry
Preparation
First, we ensured Docker and the necessary directory structure were ready. We created a dedicated folder to store Dockge's own data and the "stacks" (your container configurations).
# Create directories for Dockge mkdir -p /opt/stacks /opt/dockge cd /opt/dockge
Downloading the Compose File
We used curl to pull the official installation file directly from the Dockge maintainers.
curl https://raw.githubusercontent.com/louislam/dockge/master/compose.yaml --output compose.yaml
Starting Dockge
We launched the container using Docker Compose. This set up the web interface on port 5001.
docker compose up -d
🌐 Accessing and Using the Tools
- The Management UI (Dockge)
- URL: http://tayberry:5001
- Use: This is where you go to edit your OpenSearch configuration, restart the database, or add new tools (like a search UI later).
- The Database API (OpenSearch)
- URL: http://tayberry:9200
- Use: This is the "brain." It doesn't have a pretty webpage yet; it responds to the Python script and curl commands.
- The OpenSearch WebGui.
- URL: http://tayberry:5601
- The Indexer (Python)
- Location: Terminal via SSH.
- Use: This is the "shoveler." It’s currently in the background moving data from your 5TB disk into the Database.
OpenSearch installation
With the installation of Dockge it is simple a case of clicking compose, pasting in the yaml file, set the name to be opensearch and click deploy. The application will download if it has not already done so then it will start
Opensearch yaml
version: "3.8"
services:
opensearch-node:
# Use 'latest' to ensure we get the newest Lucene codecs
image: opensearchproject/opensearch:latest
container_name: tayberry-search
environment:
- cluster.name=openalex-cluster
- discovery.type=single-node
- bootstrap.memory_lock=true
- OPENSEARCH_JAVA_OPTS=-Xms12g -Xmx12g
- DISABLE_INSTALL_DEMO_CONFIG=true
- DISABLE_SECURITY_PLUGIN=true
ulimits:
memlock:
soft: -1
hard: -1
nofile:
soft: 65536
hard: 65536
volumes:
- /mnt/openalex/opensearch_data:/usr/share/opensearch/data
ports:
- 9200:9200
restart: unless-stopped
dashboards:
# Dashboards should also track latest for compatibility
image: opensearchproject/opensearch-dashboards:latest
container_name: tayberry-dashboards
ports:
- 5601:5601
environment:
- OPENSEARCH_HOSTS=["http://opensearch-node:9200"]
- DISABLE_SECURITY_DASHBOARDS_PLUGIN=true
depends_on:
- opensearch-node
restart: unless-stopped
networks:
default:
name: tayberry-net
🛠️ Python Environment & Watcher Setup
Installing Python & Virtual Environment
Since Tayberry is running a modern Linux (Ubuntu/Debian), Python 3 was already present, but we needed to install the pip package manager and the venv module to keep our project isolated.
# Update the system and install Python tools sudo apt update sudo apt install python3-pip python3-venv -y
⛏️Creating & Starting indexer_env
We created a virtual environment so that the OpenSearch libraries wouldn't interfere with the rest of the system.
- Create the environment:
python3 -m venv ~/indexer_env
- Start (Activate) the environment:
source ~/indexer_env/bin/activate
(We know it's working when we see (indexer_env) appear before the username in the terminal.)
- Install the "Worker" libraries:
pip install opensearch-py tqdm orjson
Starting the Indexer (Background Mode)
Because 250 million records take days to process, we used screen to keep the script running even if you close your laptop or the SSH connection drops.
- Open a named screen session:
screen -S openalex_push
- Run the script inside the screen:
python3* /mnt/openalex/index_openalex.py
- Hide the screen (Detach): Press Ctrl + A, then D.
Starting the Watcher (The Live Dashboard)
The "Watcher" is not a separate piece of software, but a clever use of the Linux watch command. It repeats the OpenSearch "count" request every 60 seconds so you can see progress in real-time.
- To start the watcher:
watch -n 60 'curl -s "http://localhost:9200/openalex_works_2026/_count?pretty" | grep count'
- Why it looks "frozen": It clears the screen and only updates the text every minute.
- How to stop it: Press Ctrl + C to get your prompt back.
📋 Summary of Commands
| Enter Env | source ~/indexer_env/bin/activate |
| Back to Indexer | screen -r openalex_push |
| Exit Indexer (Safely) | Ctrl + A, then D |
| Check Progress | curl -s "http://localhost:9200/openalex_works_2026/_count?pretty" |
| Force Update | curl -X POST "http://localhost:9200/openalex_works_2026/_refresh" |
The Data Ingestion Script
This Python script bridges the gap between the raw .gz source files and the database.
import os, gzip, orjson from opensearchpy import OpenSearch, helpers from concurrent.futures import ProcessPoolExecutor from tqdm import tqdm # Settings SOURCE_DIR = "/mnt/openalex/v2026/source_data/works" INDEX_NAME = "openalex_works_2026" THREADS = 4 # Optimized for 8-core Ryzen 5 allocation client = OpenSearch(["http://localhost:9200"], timeout=300) def get_actions(file_path): with gzip.open(file_path, 'rb') as f: for line in f: doc = orjson.loads(line) yield { "_index": INDEX_NAME, "_id": doc.get("id").split("/")[-1], "_source": doc } def process_file(file_path): actions = get_actions(file_path) helpers.bulk(client, actions, chunk_size=500, request_timeout=300) def main(): all_files = sorted([os.path.join(r, f) for r, d, fs in os.walk(SOURCE_DIR) for f in fs if f.endswith(".gz")]) # Resume logic: skip what we've already done files_to_index = all_files[180:] with ProcessPoolExecutor(max_workers=THREADS) as executor: list(tqdm(executor.map(process_file, files_to_index), total=len(files_to_index))) if __name__ == "__main__": main()
check_progress.sh
This tiny script to your indexer_env to log progress to a file. This way, you don't have to watch the screen; you can just check the log
sudo nano check_progress.sh
and paste the following script
#!/bin/bash echo "$(date): $(curl -s localhost:9200/openalex_works_2026/_count?pretty | grep count)" >> /mnt/openalex/indexing_log.txt
How to Start & Monitor
- Start Engine: Use the Dockge UI or docker compose up -d.
- Start Indexer: ```bash screen -S indexer source ~/indexer_env/bin/activate python3 /mnt/openalex/index_openalex.py
- The Webpage: OpenSearch doesn't have a "built-in" search bar, so we use the API or a dashboard:
- Check Health: http://<TAYBERRY_IP>:9200
- Check Count: http://<TAYBERRY_IP>:9200/openalex_works_2026/_count
How the tool will be used
Once the 250M papers are in, we will use this tool for:
- Academic Discovery: Finding every paper ever written on a specific niche topic.
- Trend Analysis: Seeing how research topics (like "AI" or "Graphene") have grown over decades.
- Local Knowledge Base: Connecting this data to your local AI (like AnythingLLM) so it can cite real papers when answering your questions.