OpenAlex: Difference between revisions

From Sea of Fate
Jump to navigationJump to search
Line 54: Line 54:
  sudo apt update
  sudo apt update
  sudo apt install python3-pip python3-venv -y
  sudo apt install python3-pip python3-venv -y
====Creating & Starting indexer_env====
====⛏️Creating & Starting indexer_env====
We created a virtual environment so that the OpenSearch libraries wouldn't interfere with the rest of the system.
We created a virtual environment so that the OpenSearch libraries wouldn't interfere with the rest of the system.
* Create the environment:
* Create the environment:
Line 63: Line 63:
* Install the "Worker" libraries:
* Install the "Worker" libraries:
  pip install opensearch-py tqdm orjson
  pip install opensearch-py tqdm orjson
====Starting the Indexer (Background Mode)====
====Starting the Indexer (Background Mode)====
Because 250 million records take days to process, we used screen to keep the script running even if you close your laptop or the SSH connection drops.
Because 250 million records take days to process, we used screen to keep the script running even if you close your laptop or the SSH connection drops.

Revision as of 05:24, 9 February 2026

📖Introduction

OpenAlex is a massive, open-source index of the world's scholarly research. It contains over 250 million "Works" (papers, books, etc.), along with millions of authors and institutions.

  • The Goal: To have a local, lightning-fast search engine on Tayberry that allows you to query this entire 250M+ dataset without relying on their public API.
  • The Engine: We use OpenSearch, which is a powerful "big data" search engine that excels at full-text search and complex filtering
  • We can access the OpenSearch dashboard at http://tayberry:5601/

OpenAlex will be another Data Archive that can be used along with Kiwix and the viewer for Archivebox that we have yet to setup

The Infrastructure (Tayberry VM)

We built a dedicated Virtual Machine named Tayberry to act as the "Search Host."

  • OS: Ubuntu/Debian.
  • Storage: A dedicated 5TB XFS-formatted disk (/mnt/openalex). This is crucial because OpenAlex data is over 300GB compressed, but expands significantly once indexed.
  • Compute: 8 Cores and 24GB RAM to handle the heavy math of indexing.

The Software Stack (🐋Docker & Python)

Instead of installing complex software directly on the OS, we use Docker to keep it clean.

  • Docker Container: Runs OpenSearch 3.4.0. It is configured with a 12GB Heap (RAM) to manage the data flow.
  • Source Data: Located at /mnt/openalex/v2026/source_data/works. These are thousands of .gz files containing JSON data.
  • The Indexer: A custom Python script (index_openalex.py) that acts as a "bridge." It unzips the files, reads the JSON, and pushes them into the OpenSearch engine in batches.

Docker yaml

version: "3.8"
services:
  opensearch:
    image: opensearchproject/opensearch:3.4.0
    container_name: tayberry-search
    environment:
      - cluster.name=openalex-cluster
      - node.name=tayberry-node
      - discovery.type=single-node
      - bootstrap.memory_lock=true
      - "OPENSEARCH_JAVA_OPTS=-Xms12g -Xmx12g"
      - DISABLE_INSTALL_DEMO_CONFIG=true
      - DISABLE_SECURITY_PLUGIN=true
    ulimits:
       memlock:
         soft: -1
         hard: -1
       nofile:
        soft: 65536
        hard: 65536
    volumes:
      - /mnt/openalex/opensearch_data:/usr/share/opensearch/data
    ports:
      - 9200:9200
    restart: unless-stopped

🛠️ Python Environment & Watcher Setup

Installing Python & Virtual Environment

Since Tayberry is running a modern Linux (Ubuntu/Debian), Python 3 was already present, but we needed to install the pip package manager and the venv module to keep our project isolated.

# Update the system and install Python tools
sudo apt update
sudo apt install python3-pip python3-venv -y

⛏️Creating & Starting indexer_env

We created a virtual environment so that the OpenSearch libraries wouldn't interfere with the rest of the system.

  • Create the environment:
python3 -m venv ~/indexer_env
  • Start (Activate) the environment:
source ~/indexer_env/bin/activate

(We know it's working when we see (indexer_env) appear before the username in the terminal.)

  • Install the "Worker" libraries:
pip install opensearch-py tqdm orjson

Starting the Indexer (Background Mode)

Because 250 million records take days to process, we used screen to keep the script running even if you close your laptop or the SSH connection drops.

  • Open a named screen session:
screen -S openalex_push
  • Run the script inside the screen:
python3*  /mnt/openalex/index_openalex.py
  • Hide the screen (Detach): Press Ctrl + A, then D.

Starting the Watcher (The Live Dashboard)

The "Watcher" is not a separate piece of software, but a clever use of the Linux watch command. It repeats the OpenSearch "count" request every 60 seconds so you can see progress in real-time.

  • To start the watcher:
watch -n 60 'curl -s "http://localhost:9200/openalex_works_2026/_count?pretty" | grep count'
  • Why it looks "frozen": It clears the screen and only updates the text every minute.
  • How to stop it: Press Ctrl + C to get your prompt back.

📋 Summary of Commands

Enter Env source ~/indexer_env/bin/activate
Back to Indexer screen -r openalex_push
Exit Indexer (Safely) Ctrl + A, then D
Check Progress curl -s "http://localhost:9200/openalex_works_2026/_count?pretty"
Force Update curl -X POST "http://localhost:9200/openalex_works_2026/_refresh"

The Data Ingestion Script

This Python script bridges the gap between the raw .gz source files and the database.

import os, gzip, orjson
from opensearchpy import OpenSearch, helpers
from concurrent.futures import ProcessPoolExecutor
from tqdm import tqdm

# Settings
SOURCE_DIR = "/mnt/openalex/v2026/source_data/works"
INDEX_NAME = "openalex_works_2026"
THREADS = 4  # Optimized for 8-core Ryzen 5 allocation

client = OpenSearch(["http://localhost:9200"], timeout=300)

def get_actions(file_path):
    with gzip.open(file_path, 'rb') as f:
        for line in f:
            doc = orjson.loads(line)
            yield {
                "_index": INDEX_NAME,
                "_id": doc.get("id").split("/")[-1],
                "_source": doc
            }

def process_file(file_path):
    actions = get_actions(file_path)
    helpers.bulk(client, actions, chunk_size=500, request_timeout=300)

def main():
    all_files = sorted([os.path.join(r, f) for r, d, fs in os.walk(SOURCE_DIR) for f in fs if f.endswith(".gz")])
    # Resume logic: skip what we've already done
    files_to_index = all_files[180:] 
     
    with ProcessPoolExecutor(max_workers=THREADS) as executor:
        list(tqdm(executor.map(process_file, files_to_index), total=len(files_to_index)))

if __name__ == "__main__":
    main()

How to Start & Monitor

  • Start Engine: Use the Dockge UI or docker compose up -d.
  • Start Indexer: ```bash screen -S indexer source ~/indexer_env/bin/activate python3 /mnt/openalex/index_openalex.py
  • The Webpage: OpenSearch doesn't have a "built-in" search bar, so we use the API or a dashboard:
    • Check Health: http://<TAYBERRY_IP>:9200
    • Check Count: http://<TAYBERRY_IP>:9200/openalex_works_2026/_count

How the tool will be used

Once the 250M papers are in, we will use this tool for:

  • Academic Discovery: Finding every paper ever written on a specific niche topic.
  • Trend Analysis: Seeing how research topics (like "AI" or "Graphene") have grown over decades.
  • Local Knowledge Base: Connecting this data to your local AI (like AnythingLLM) so it can cite real papers when answering your questions.