OpenAlex: Difference between revisions
From Sea of Fate
Jump to navigationJump to search
Wikisailor (talk | contribs) |
Wikisailor (talk | contribs) |
||
| Line 45: | Line 45: | ||
- 9200:9200 | - 9200:9200 | ||
restart: unless-stopped | restart: unless-stopped | ||
===The Data Ingestion Script=== | |||
This Python script bridges the gap between the raw .gz source files and the database. | |||
import os, gzip, orjson | |||
from opensearchpy import OpenSearch, helpers | |||
from concurrent.futures import ProcessPoolExecutor | |||
from tqdm import tqdm | |||
# Settings | |||
SOURCE_DIR = "/mnt/openalex/v2026/source_data/works" | |||
INDEX_NAME = "openalex_works_2026" | |||
THREADS = 4 # Optimized for 8-core Ryzen 5 allocation | |||
client = OpenSearch(["http://localhost:9200"], timeout=300) | |||
def get_actions(file_path): | |||
with gzip.open(file_path, 'rb') as f: | |||
for line in f: | |||
doc = orjson.loads(line) | |||
yield { | |||
"_index": INDEX_NAME, | |||
"_id": doc.get("id").split("/")[-1], | |||
"_source": doc | |||
} | |||
def process_file(file_path): | |||
actions = get_actions(file_path) | |||
helpers.bulk(client, actions, chunk_size=500, request_timeout=300) | |||
def main(): | |||
all_files = sorted([os.path.join(r, f) for r, d, fs in os.walk(SOURCE_DIR) for f in fs if f.endswith(".gz")]) | |||
# Resume logic: skip what we've already done | |||
files_to_index = all_files[180:] | |||
with ProcessPoolExecutor(max_workers=THREADS) as executor: | |||
list(tqdm(executor.map(process_file, files_to_index), total=len(files_to_index))) | |||
if __name__ == "__main__": | |||
main() | |||
Revision as of 04:23, 9 February 2026
Introduction
OpenAlex is a massive, open-source index of the world's scholarly research. It contains over 250 million "Works" (papers, books, etc.), along with millions of authors and institutions.
- The Goal: To have a local, lightning-fast search engine on Tayberry that allows you to query this entire 250M+ dataset without relying on their public API.
- The Engine: We use OpenSearch, which is a powerful "big data" search engine that excels at full-text search and complex filtering
OpenAlex will be another Data Archive that can be used along with Kiwix and the viewer for Archivebox that we have yet to setup
The Infrastructure (Tayberry VM)
We built a dedicated Virtual Machine named Tayberry to act as the "Search Host."
- OS: Ubuntu/Debian.
- Storage: A dedicated 5TB XFS-formatted disk (/mnt/openalex). This is crucial because OpenAlex data is over 300GB compressed, but expands significantly once indexed.
- Compute: 8 Cores and 24GB RAM to handle the heavy math of indexing.
The Software Stack (Docker & Python)
Instead of installing complex software directly on the OS, we use Docker to keep it clean.
- Docker Container: Runs OpenSearch 3.4.0. It is configured with a 12GB Heap (RAM) to manage the data flow.
- Source Data: Located at /mnt/openalex/v2026/source_data/works. These are thousands of .gz files containing JSON data.
- The Indexer: A custom Python script (index_openalex.py) that acts as a "bridge." It unzips the files, reads the JSON, and pushes them into the OpenSearch engine in batches.
Docker yaml
version: "3.8"
services:
opensearch:
image: opensearchproject/opensearch:3.4.0
container_name: tayberry-search
environment:
- cluster.name=openalex-cluster
- node.name=tayberry-node
- discovery.type=single-node
- bootstrap.memory_lock=true
- "OPENSEARCH_JAVA_OPTS=-Xms12g -Xmx12g"
- DISABLE_INSTALL_DEMO_CONFIG=true
- DISABLE_SECURITY_PLUGIN=true
ulimits:
memlock:
soft: -1
hard: -1
nofile:
soft: 65536
hard: 65536
volumes:
- /mnt/openalex/opensearch_data:/usr/share/opensearch/data
ports:
- 9200:9200
restart: unless-stopped
The Data Ingestion Script
This Python script bridges the gap between the raw .gz source files and the database.
import os, gzip, orjson from opensearchpy import OpenSearch, helpers from concurrent.futures import ProcessPoolExecutor from tqdm import tqdm # Settings SOURCE_DIR = "/mnt/openalex/v2026/source_data/works" INDEX_NAME = "openalex_works_2026" THREADS = 4 # Optimized for 8-core Ryzen 5 allocation client = OpenSearch(["http://localhost:9200"], timeout=300) def get_actions(file_path): with gzip.open(file_path, 'rb') as f: for line in f: doc = orjson.loads(line) yield { "_index": INDEX_NAME, "_id": doc.get("id").split("/")[-1], "_source": doc } def process_file(file_path): actions = get_actions(file_path) helpers.bulk(client, actions, chunk_size=500, request_timeout=300) def main(): all_files = sorted([os.path.join(r, f) for r, d, fs in os.walk(SOURCE_DIR) for f in fs if f.endswith(".gz")]) # Resume logic: skip what we've already done files_to_index = all_files[180:] with ProcessPoolExecutor(max_workers=THREADS) as executor: list(tqdm(executor.map(process_file, files_to_index), total=len(files_to_index))) if __name__ == "__main__": main()