OpenAlex: Difference between revisions

Revision as of 06:31, 9 February 2026

🏗️ The OpenAlex Scholar Engine (Tayberry)

📖 Introduction

OpenAlex is a massive, open-source index of the world's scholarly research. It contains over 250 million "Works" (papers, books, etc.), along with millions of authors and institutions.

The Data Structure: More than just a list, OpenAlex is a heterogeneous directed graph. This means it maps complex relationships between different types of entities—linking authors to their papers, papers to their citations, and institutions to their researchers. This structure makes it an incredibly powerful tool for tracking the influence and evolution of scientific thought.
The Goal: To have a local, lightning-fast search engine on Tayberry that allows you to query this entire dataset without relying on public API limits.
The Engine: We use OpenSearch, a high-performance "big data" engine designed for complex filtering and full-text search.
Synergy: OpenAlex serves as a deep data archive that works alongside Kiwix (offline Wikipedia/StackOverflow) and ArchiveBox to create a complete local research library.

💾The Infrastructure (Tayberry VM)

Tayberry is a dedicated Virtual Machine on our Proxmox host (Ryzen 5).

OS: Debian.
Compute: 8 Cores (Host has 12) and 24GB RAM.
Storage: A dedicated 5TB XFS-formatted disk (/mnt/openalex). OpenAlex is ~300GB compressed but expands into multiple terabytes once indexed and searchable

🐋 The Software Stack (Docker & Python)

🛠️ Installing Dockge

Dockge allows us to manage our "Stacks" (Docker Compose files) through a clean web interface.

# Preparation: Create directories
mkdir -p /opt/stacks /opt/dockge
cd /opt/dockge
# Download and Start Dockge
curl https://raw.githubusercontent.com/louislam/dockge/master/compose.yaml --output compose.yaml
docker compose up -d

🌐 Accessing and Using the Tools

Tool	URL	Purpose
Dockge UI	http://tayberry:5001	Managing the Docker containers/stacks.
Database API	http://tayberry:9200	The backend "brain" where the Python script sends data.
Web GUI	http://tayberry:5601	OpenSearch Dashboards (Visual search and data exploration).

OpenSearch YAML (The Stack)

Paste this into Dockge to deploy the engine and the Web GUI and assign the name opensearch

version: "3.8"
services:
  opensearch-node:
    # Use 'latest' to ensure we get the newest Lucene codecs
    image: opensearchproject/opensearch:latest 
    container_name: tayberry-search
    environment:
      - cluster.name=openalex-cluster
      - discovery.type=single-node
      - bootstrap.memory_lock=true
      - OPENSEARCH_JAVA_OPTS=-Xms12g -Xmx12g
      - DISABLE_INSTALL_DEMO_CONFIG=true
      - DISABLE_SECURITY_PLUGIN=true
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536
        hard: 65536
    volumes:
      - /mnt/openalex/opensearch_data:/usr/share/opensearch/data
    ports:
      - 9200:9200
    restart: unless-stopped
  dashboards:
    # Dashboards should also track latest for compatibility
    image: opensearchproject/opensearch-dashboards:latest
    container_name: tayberry-dashboards
    ports:
      - 5601:5601
    environment:
      - OPENSEARCH_HOSTS=["http://opensearch-node:9200"]
      - DISABLE_SECURITY_DASHBOARDS_PLUGIN=true
    depends_on:
      - opensearch-node
    restart: unless-stopped
networks:
  default:
    name: tayberry-net

🛠️ Python Environment & Watcher Setup

Creating the indexer_env

sudo apt update && sudo apt install python3-pip python3-venv -y
python3 -m venv ~/indexer_env
source ~/indexer_env/bin/activate
pip install opensearch-py tqdm orjson

Starting the Indexer (Background Mode)

Use screen to ensure the process continues even if you log out. screen -S openalex_push

Inside the screen:

source ~/indexer_env/bin/activate python3 /mnt/openalex/index_openalex.py

To Detach: Press Ctrl + A, then D

📊 Monitoring: The Watcher vs. The Web GUI

You have two ways to see if the engine is working:

The Watcher (Terminal): A real-time command-line dashboard.

watch -n 60 'curl -s "http://localhost:9200/openalex_works_2026/_count?pretty" | grep count'

Note: If the count stays frozen, the engine is in "High Gear" (buffering). Stop with Ctrl+C.

Starting the Indexer (Background Mode)

Because 250 million records take days to process, we used screen to keep the script running even if you close your laptop or the SSH connection drops.

Open a named screen session:

screen -S openalex_push

Run the script inside the screen:

python3*  /mnt/openalex/index_openalex.py

Hide the screen (Detach): Press Ctrl + A, then D.

Starting the Watcher (The Live Dashboard)

The "Watcher" is not a separate piece of software, but a clever use of the Linux watch command. It repeats the OpenSearch "count" request every 60 seconds so you can see progress in real-time.

To start the watcher:

watch -n 60 'curl -s "http://localhost:9200/openalex_works_2026/_count?pretty" | grep count'

Why it looks "frozen": It clears the screen and only updates the text every minute.
How to stop it: Press Ctrl + C to get your prompt back.

📋 Summary of Commands


Enter Env	source ~/indexer_env/bin/activate
Back to Indexer	screen -r openalex_push
Exit Indexer (Safely)	Ctrl + A, then D
Check Progress	curl -s "http://localhost:9200/openalex_works_2026/_count?pretty"
Force Update	curl -X POST "http://localhost:9200/openalex_works_2026/_refresh"

The Data Ingestion Script

This Python script bridges the gap between the raw .gz source files and the database.

import os, gzip, orjson
from opensearchpy import OpenSearch, helpers
from concurrent.futures import ProcessPoolExecutor
from tqdm import tqdm

# Settings
SOURCE_DIR = "/mnt/openalex/v2026/source_data/works"
INDEX_NAME = "openalex_works_2026"
THREADS = 4  # Optimized for 8-core Ryzen 5 allocation

client = OpenSearch(["http://localhost:9200"], timeout=300)

def get_actions(file_path):
    with gzip.open(file_path, 'rb') as f:
        for line in f:
            doc = orjson.loads(line)
            yield {
                "_index": INDEX_NAME,
                "_id": doc.get("id").split("/")[-1],
                "_source": doc
            }

def process_file(file_path):
    actions = get_actions(file_path)
    helpers.bulk(client, actions, chunk_size=500, request_timeout=300)

def main():
    all_files = sorted([os.path.join(r, f) for r, d, fs in os.walk(SOURCE_DIR) for f in fs if f.endswith(".gz")])
    # Resume logic: skip what we've already done
    files_to_index = all_files[180:] 
     
    with ProcessPoolExecutor(max_workers=THREADS) as executor:
        list(tqdm(executor.map(process_file, files_to_index), total=len(files_to_index)))

if __name__ == "__main__":
    main()

check_progress.sh

This tiny script to your indexer_env to log progress to a file. This way, you don't have to watch the screen; you can just check the log

sudo nano check_progress.sh

and paste the following script

#!/bin/bash
echo "$(date): $(curl -s localhost:9200/openalex_works_2026/_count?pretty | grep count)" >> /mnt/openalex/indexing_log.txt

How to Start & Monitor

Start Engine: Use the Dockge UI or docker compose up -d.
Start Indexer: ```bash screen -S indexer source ~/indexer_env/bin/activate python3 /mnt/openalex/index_openalex.py
The Webpage: OpenSearch doesn't have a "built-in" search bar, so we use the API or a dashboard:
- Check Health: http://<TAYBERRY_IP>:9200
- Check Count: http://<TAYBERRY_IP>:9200/openalex_works_2026/_count

How the tool will be used

Once the 250M papers are in, we will use this tool for:

Academic Discovery: Finding every paper ever written on a specific niche topic.
Trend Analysis: Seeing how research topics (like "AI" or "Graphene") have grown over decades.
Local Knowledge Base: Connecting this data to your local AI (like AnythingLLM) so it can cite real papers when answering your questions.

@@ Line 14: / Line 14: @@
 * Storage: A dedicated 5TB XFS-formatted disk (/mnt/openalex). OpenAlex is ~300GB compressed but expands into multiple terabytes once indexed and searchable
-==🐋 3. The Software Stack (Docker & Python)==
+==🐋  The Software Stack (Docker & Python)==
-===🛠️ 3 Installing Dockge===
+===🛠️  Installing Dockge===
 Dockge allows us to manage our "Stacks" (Docker Compose files) through a clean web interface.
   # Preparation: Create directories
@@ Line 36: / Line 36: @@
 |}
-====OpenSearch installation====
+===OpenSearch YAML (The Stack)===
-With the installation of Dockge it is simple a case of clicking compose, pasting in the yaml file, set the name to be opensearch and click deploy. The application will download if it has not already done so then it will start
+Paste this into Dockge to deploy the engine and the Web GUI and assign the name opensearch
-===Opensearch yaml===
   version: "3.8"
   services:
@@ Line 84: / Line 83: @@
 ===🛠️ Python Environment & Watcher Setup===
-====Installing Python & Virtual Environment====
+====Creating the indexer_env====
-Since Tayberry is running a modern Linux (Ubuntu/Debian), Python 3 was already present, but we needed to install the pip package manager and the venv module to keep our project isolated.
+  sudo apt update && sudo apt install python3-pip python3-venv -y
- # Update the system and install Python tools
-  sudo apt update
- sudo apt install python3-pip python3-venv -y
-====⛏️Creating & Starting indexer_env====
-We created a virtual environment so that the OpenSearch libraries wouldn't interfere with the rest of the system.
-* Create the environment:
   python3 -m venv ~/indexer_env
-* Start (Activate) the environment:
   source ~/indexer_env/bin/activate
-(We know it's working when we see (indexer_env) appear before the username in the terminal.)
-* Install the "Worker" libraries:
   pip install opensearch-py tqdm orjson
+====Starting the Indexer (Background Mode)====
+Use screen to ensure the process continues even if you log out.
+screen -S openalex_push
+# Inside the screen:
+source ~/indexer_env/bin/activate
+python3 /mnt/openalex/index_openalex.py
+# To Detach: Press Ctrl + A, then D
+====📊  Monitoring: The Watcher vs. The Web GUI====
+You have two ways to see if the engine is working:
+* The Watcher (Terminal): A real-time command-line dashboard.
+ watch -n 60 'curl -s "http://localhost:9200/openalex_works_2026/_count?pretty" | grep count'
+''Note: If the count stays frozen, the engine is in "High Gear" (buffering). Stop with Ctrl+C.''
 ====Starting the Indexer (Background Mode)====

OpenAlex: Difference between revisions

Revision as of 06:31, 9 February 2026

Contents

🏗️ The OpenAlex Scholar Engine (Tayberry)

📖 Introduction

💾The Infrastructure (Tayberry VM)

🐋 The Software Stack (Docker & Python)

🛠️ Installing Dockge

🌐 Accessing and Using the Tools

OpenSearch YAML (The Stack)

🛠️ Python Environment & Watcher Setup

Creating the indexer_env

Starting the Indexer (Background Mode)

📊 Monitoring: The Watcher vs. The Web GUI

Starting the Indexer (Background Mode)

Starting the Watcher (The Live Dashboard)

📋 Summary of Commands

The Data Ingestion Script

check_progress.sh

How to Start & Monitor

How the tool will be used

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools