OpenAlex

From Sea of Fate
Jump to navigationJump to search

📖Introduction

OpenAlex represents a significant milestone in the democratization of scholarly data, functioning as a fully open and massive catalog of the world's academic research. Named after the ancient Library of Alexandria, the primary goal of the OpenAlex database is to provide a comprehensive, linked index of scientific papers, authors, institutional affiliations, and funding sources without the restrictive paywalls associated with traditional proprietary bibliometric services. By capturing hundreds of millions of entities and the trillions of connections between them, the platform allows researchers to map the global landscape of human knowledge. It serves as a vital resource for bibliometrics, the study of science itself, and the development of discovery tools that ensure academic information remains a public good rather than a siloed asset.

The technical journey of hosting and querying this vast dataset on local infrastructure has evolved significantly to meet the demands of such a high-volume repository. Initially, the implementation relied on OpenSearch to handle the heavy lifting of indexing and searching the data. OpenSearch is a community-driven, open-source search and analytics suite that famously began as an offshoot of Elasticsearch. While OpenSearch provided powerful full-text search capabilities and a familiar API for navigating the complex web of citations and metadata, it brought with it the substantial overhead characteristic of the Java Virtual Machine. The JVM requirements for a dataset of this scale are intensive, often demanding significant memory allocation and constant tuning to maintain performance. For a localized environment, these system requirements proved to be a heavy burden on hardware resources that could be utilized more efficiently elsewhere.

In response to these hardware demands, the workflow has transitioned toward the use of DuckDB. This move away from a persistent, JVM-heavy cluster to an in-process analytical database represents a shift toward speed and portability. DuckDB allows for high-performance SQL queries directly against compressed data formats like Parquet without the need for a standing background service. This transition significantly lowers the barrier for complex data analysis, enabling the processing of hundreds of millions of rows with a much smaller footprint. By leveraging the columnar storage of Parquet files, the system can now perform massive joins and aggregations across the entire OpenAlex snapshot with a level of agility that was previously difficult to achieve under the traditional search engine architecture.

One of the most compelling applications for this local OpenAlex instance is its role in the verification and enhancement of other massive open-knowledge projects. Specifically, this database provides a robust foundation for verifying the accuracy of academic citations within Wikipedia ZIM files and similar offline knowledge archives. By cross-referencing the metadata stored in OpenAlex with the references found in these snapshots, it becomes possible to ensure that the scientific claims made in public encyclopedias are backed by verifiable, indexed research. This capability transforms OpenAlex from a simple list of papers into a critical tool for knowledge integrity, ensuring that as information is distributed globally through offline and open-source formats, it remains grounded in the verified record of global scholarship

OpenAlex Project: Data Recovery & Rebuild Strategy

Date: March 2026 | Server: Tayberry | Snapshot: Jan 2026

Data Usage & Dependency Map

Directory Path What it Powers Right Now Consequence of Deletion
/v2026/parquet_data/ DuckDB Database. Powers current analytical scripts. FUNCTIONAL LOSS: You can no longer use DuckDB. Re-extraction from source_data takes 4+ days.
/v2026/source_data/ Original JSON. The "Gold Copy" insurance policy. REBUILD IMPOSSIBLE: Cannot fix corrupted Parquet, move to new DBs, or revert to OpenSearch. AWS may rotate this snapshot out.
/opensearch_data/ Live Search Engine. Powers Docker API and browse.html. ENGINE FAILURE: OpenSearch stops. Re-indexing takes 10+ days. If HDD is full and you only use DuckDB, purge this first.
/legacy-data/ History (2023-2025). Data as it appeared in the past. LOSS OF HISTORY: Independent of 2026. Deleting this will NOT break the 2026 DB, but you lose "past vs. present" comparison ability.

"What If" Recovery & Transfer Scenarios

Your Goal What You MUST Keep The Process / Steps
I want to rebuild Parquet /v2026/source_data/ Run python3 convert_to_parquet.py.
I want to use a NEW DB /v2026/source_data/ Point the new DB importer (ClickHouse/Postgres) at the raw JSON in source_data.
I want to revert to OpenSearch /v2026/source_data/ Run python3 index_openalex.py. Note: Takes 10+ days.
I want to see 2024 data /legacy-data/ Point scripts at /legacy-data/. 2026 files do NOT contain 2024 snapshot records.
I want to use DuckDB on another host /v2026/parquet_data/ 1. scp /v2026/parquet_data/* user@host:~/data/
2. pip install duckdb
3. Verify with python3 -c "import duckdb; print(duckdb.query(\"SELECT count(*) FROM 'authors.parquet'\").fetchone()[0])"
I want to move data SAFELY (rsync) /v2026/parquet_data/ rsync -avP /mnt/openalex/v2026/parquet_data/ user@remote:~/dest/
-P allows resume if connection drops.
I want to compress to a single file /v2026/parquet_data/ tar -cvzf openalex_parquet_backup.tar.gz /mnt/openalex/v2026/parquet_data/
I want to consolidate to a single .duckdb file /v2026/parquet_data/ duckdb openalex_full.duckdb "CREATE TABLE works AS SELECT * FROM 'works.parquet'; CREATE TABLE authors AS SELECT * FROM 'authors.parquet';"

Storage Optimization Guide

Purge Type Directory Effect
Safe Purge /v2026/snapshot_raw/ Frees ~50GB with zero impact on functionality.
Time Traveler Purge /legacy-data/ Saves ~400GB. 2026 remains functional; the "past" is gone.
DuckDB Committed Purge /opensearch_data/ Saves ~800GB. API/Web search dies; DuckDB is unaffected.
Parquet-Only Gamble /v2026/source_data/ Saves ~350GB. WARNING: High risk. No way to fix corrupted Parquet without a massive new AWS download.

Final Warning: Deleting source_data makes Parquet your ONLY copy of the Jan 2026 record.

🏗️ The OpenAlex Scholar Engine

OpenAlex is a massive, open-source index of the world's scholarly research. It contains over 250 million "Works" (papers, books, etc.), along with millions of authors and institutions.

  • The Data Structure: More than just a list, OpenAlex is a heterogeneous directed graph. This means it maps complex relationships between different types of entities—linking authors to their papers, papers to their citations, and institutions to their researchers. This structure makes it an incredibly powerful tool for tracking the influence and evolution of scientific thought.
  • The Goal: To have a local, lightning-fast search engine on Tayberry that allows you to query this entire dataset without relying on public API limits.
  • The Engine: We use OpenSearch, a high-performance "big data" engine designed for complex filtering and full-text search.
  • Synergy: OpenAlex serves as a deep data archive that works alongside The Kiwix Archive (offline Wikipedia/StackOverflow) and Archive Box to create a complete local research library.

💾The Infrastructure (Tayberry VM)

Tayberry is a dedicated Virtual Machine on our Proxmox host (Ryzen 5).

  • OS: Debian.
  • Compute: 8 Cores (Host has 12) and 24GB RAM.
  • Storage: A dedicated 5TB XFS-formatted disk (/mnt/openalex). OpenAlex is ~300GB compressed but expands into multiple terabytes once indexed and searchable

🐋 The Software Stack (Docker & Python)

🛠️ Installing Dockge

Dockge allows us to manage our "Stacks" (Docker Compose files) through a clean web interface.

# Preparation: Create directories
mkdir -p /opt/stacks /opt/dockge
cd /opt/dockge
# Download and Start Dockge
curl https://raw.githubusercontent.com/louislam/dockge/master/compose.yaml --output compose.yaml
docker compose up -d

🌐 Accessing and Using the Tools

Tool URL Purpose
Dockge UI http://tayberry:5001 Managing the Docker containers/stacks.
Database API http://tayberry:9200 The backend "brain" where the Python script sends data.
Web GUI http://tayberry:5601 OpenSearch Dashboards (Visual search and data exploration).

OpenSearch YAML (The Stack)

Paste this into Dockge to deploy the engine and the Web GUI and assign the name opensearch

version: "3.8"
services:
  opensearch-node:
    # Use 'latest' to ensure we get the newest Lucene codecs
    image: opensearchproject/opensearch:latest 
    container_name: tayberry-search
    environment:
      - cluster.name=openalex-cluster
      - discovery.type=single-node
      - bootstrap.memory_lock=true
      - OPENSEARCH_JAVA_OPTS=-Xms12g -Xmx12g
      - DISABLE_INSTALL_DEMO_CONFIG=true
      - DISABLE_SECURITY_PLUGIN=true
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536
        hard: 65536
    volumes:
      - /mnt/openalex/opensearch_data:/usr/share/opensearch/data
    ports:
      - 9200:9200
    restart: unless-stopped
  dashboards:
    # Dashboards should also track latest for compatibility
    image: opensearchproject/opensearch-dashboards:latest
    container_name: tayberry-dashboards
    ports:
      - 5601:5601
    environment:
      - OPENSEARCH_HOSTS=["http://opensearch-node:9200"]
      - DISABLE_SECURITY_DASHBOARDS_PLUGIN=true
    depends_on:
      - opensearch-node
    restart: unless-stopped
networks:
  default:
    name: tayberry-net

🛠️ Python Environment & Watcher Setup

Creating the indexer_env
sudo apt update && sudo apt install python3-pip python3-venv -y
python3 -m venv ~/indexer_env
source ~/indexer_env/bin/activate
pip install opensearch-py tqdm orjson
Starting the Indexer (Background Mode)

Use screen to ensure the process continues even if you log out. screen -S openalex_push

  1. Inside the screen:

source ~/indexer_env/bin/activate python3 /mnt/openalex/index_openalex.py

  1. To Detach: Press Ctrl + A, then D
📊 Monitoring: The Watcher vs. The Web GUI

You have two ways to see if the engine is working:

  • The Watcher (Terminal): A real-time command-line dashboard.
watch -n 60 'curl -s "http://localhost:9200/openalex_works_2026/_count?pretty" | grep count'

Note: If the count stays frozen, the engine is in "High Gear" (buffering). Stop with Ctrl+C.

  • OpenSearch Dashboards (Web GUI): Navigate to http://tayberry:5601. You can use the Dev Tools tab to run queries or the Discover tab to see visual histograms of the papers as they arrive.
📜 check_progress.sh (The History Log)

Create this to track progress over several days:

#!/bin/bash
# sudo nano ~/indexer_env/check_progress.sh
echo "$(date): $(curl -s localhost:9200/openalex_works_2026/_count?pretty | grep count)" >> /mnt/openalex/indexing_log.txt

How to Start & Monitor

  • Start Engine: Use the Dockge UI or docker compose up -d.
  • Start Indexer: ```bash screen -S indexer source ~/indexer_env/bin/activate python3 /mnt/openalex/index_openalex.py
  • The Webpage: OpenSearch doesn't have a "built-in" search bar, so we use the API or a dashboard:
    • Check Health: http://<TAYBERRY_IP>:9200
    • Check Count: http://<TAYBERRY_IP>:9200/openalex_works_2026/_count

🚀How the tool will be used

Once the 250M papers are ingested:

  • Trend Analysis: Tracking how specific technologies (like "Graphene") have evolved.
  • Local AI Integration: Connecting Tayberry to AnythingLLM on Blackberry. Your local AI will query Tayberry to cite real scientific papers in its answers, eliminating hallucinations.


OpenAlex: A fully-open index of scholarly works

This video explains the technical scope and data richness of OpenAlex, helping you understand the scale of the 250M records you are currently indexing.


DuckDb and Parquet