OpenAlex: Difference between revisions
Wikisailor (talk | contribs) No edit summary |
Wikisailor (talk | contribs) |
||
| Line 1: | Line 1: | ||
==๐๏ธ The OpenAlex Scholar Engine | ==๐Introduction== | ||
ย | |||
==๐๏ธ The OpenAlex Scholar Engine == | |||
OpenAlex is a massive, open-source index of the world's scholarly research. It contains over 250 million "Works" (papers, books, etc.), along with millions of authors and institutions. | OpenAlex is a massive, open-source index of the world's scholarly research. It contains over 250 million "Works" (papers, books, etc.), along with millions of authors and institutions. | ||
Revision as of 04:37, 10 March 2026
๐Introduction
๐๏ธ The OpenAlex Scholar Engine
OpenAlex is a massive, open-source index of the world's scholarly research. It contains over 250 million "Works" (papers, books, etc.), along with millions of authors and institutions.
- The Data Structure: More than just a list, OpenAlex is a heterogeneous directed graph. This means it maps complex relationships between different types of entitiesโlinking authors to their papers, papers to their citations, and institutions to their researchers. This structure makes it an incredibly powerful tool for tracking the influence and evolution of scientific thought.
- The Goal: To have a local, lightning-fast search engine on Tayberry that allows you to query this entire dataset without relying on public API limits.
- The Engine: We use OpenSearch, a high-performance "big data" engine designed for complex filtering and full-text search.
- Synergy: OpenAlex serves as a deep data archive that works alongside The Kiwix Archive (offline Wikipedia/StackOverflow) and Archive Box to create a complete local research library.
๐พThe Infrastructure (Tayberry VM)
Tayberry is a dedicated Virtual Machine on our Proxmox host (Ryzen 5).
- OS: Debian.
- Compute: 8 Cores (Host has 12) and 24GB RAM.
- Storage: A dedicated 5TB XFS-formatted disk (/mnt/openalex). OpenAlex is ~300GB compressed but expands into multiple terabytes once indexed and searchable
๐ The Software Stack (Docker & Python)
๐ ๏ธ Installing Dockge
Dockge allows us to manage our "Stacks" (Docker Compose files) through a clean web interface.
# Preparation: Create directories mkdir -p /opt/stacks /opt/dockge cd /opt/dockge # Download and Start Dockge curl https://raw.githubusercontent.com/louislam/dockge/master/compose.yaml --output compose.yaml docker compose up -d
๐ Accessing and Using the Tools
| Tool | URL | Purpose |
|---|---|---|
| Dockge UI | http://tayberry:5001 | Managing the Docker containers/stacks. |
| Database API | http://tayberry:9200 | The backend "brain" where the Python script sends data. |
| Web GUI | http://tayberry:5601 | OpenSearch Dashboards (Visual search and data exploration). |
OpenSearch YAML (The Stack)
Paste this into Dockge to deploy the engine and the Web GUI and assign the name opensearch
version: "3.8"
services:
opensearch-node:
# Use 'latest' to ensure we get the newest Lucene codecs
image: opensearchproject/opensearch:latest
container_name: tayberry-search
environment:
- cluster.name=openalex-cluster
- discovery.type=single-node
- bootstrap.memory_lock=true
- OPENSEARCH_JAVA_OPTS=-Xms12g -Xmx12g
- DISABLE_INSTALL_DEMO_CONFIG=true
- DISABLE_SECURITY_PLUGIN=true
ulimits:
memlock:
soft: -1
hard: -1
nofile:
soft: 65536
hard: 65536
volumes:
- /mnt/openalex/opensearch_data:/usr/share/opensearch/data
ports:
- 9200:9200
restart: unless-stopped
dashboards:
# Dashboards should also track latest for compatibility
image: opensearchproject/opensearch-dashboards:latest
container_name: tayberry-dashboards
ports:
- 5601:5601
environment:
- OPENSEARCH_HOSTS=["http://opensearch-node:9200"]
- DISABLE_SECURITY_DASHBOARDS_PLUGIN=true
depends_on:
- opensearch-node
restart: unless-stopped
networks:
default:
name: tayberry-net
๐ ๏ธ Python Environment & Watcher Setup
Creating the indexer_env
sudo apt update && sudo apt install python3-pip python3-venv -y python3 -m venv ~/indexer_env source ~/indexer_env/bin/activate pip install opensearch-py tqdm orjson
Starting the Indexer (Background Mode)
Use screen to ensure the process continues even if you log out. screen -S openalex_push
- Inside the screen:
source ~/indexer_env/bin/activate python3 /mnt/openalex/index_openalex.py
- To Detach: Press Ctrl + A, then D
๐ Monitoring: The Watcher vs. The Web GUI
You have two ways to see if the engine is working:
- The Watcher (Terminal): A real-time command-line dashboard.
watch -n 60 'curl -s "http://localhost:9200/openalex_works_2026/_count?pretty" | grep count'
Note: If the count stays frozen, the engine is in "High Gear" (buffering). Stop with Ctrl+C.
- OpenSearch Dashboards (Web GUI): Navigate to http://tayberry:5601. You can use the Dev Tools tab to run queries or the Discover tab to see visual histograms of the papers as they arrive.
๐ check_progress.sh (The History Log)
Create this to track progress over several days:
#!/bin/bash # sudo nano ~/indexer_env/check_progress.sh echo "$(date): $(curl -s localhost:9200/openalex_works_2026/_count?pretty | grep count)" >> /mnt/openalex/indexing_log.txt
How to Start & Monitor
- Start Engine: Use the Dockge UI or docker compose up -d.
- Start Indexer: ```bash screen -S indexer source ~/indexer_env/bin/activate python3 /mnt/openalex/index_openalex.py
- The Webpage: OpenSearch doesn't have a "built-in" search bar, so we use the API or a dashboard:
- Check Health: http://<TAYBERRY_IP>:9200
- Check Count: http://<TAYBERRY_IP>:9200/openalex_works_2026/_count
๐How the tool will be used
Once the 250M papers are ingested:
- Trend Analysis: Tracking how specific technologies (like "Graphene") have evolved.
- Local AI Integration: Connecting Tayberry to AnythingLLM on Blackberry. Your local AI will query Tayberry to cite real scientific papers in its answers, eliminating hallucinations.
OpenAlex: A fully-open index of scholarly works
This video explains the technical scope and data richness of OpenAlex, helping you understand the scale of the 250M records you are currently indexing.