Architecture Deep Dive: The Scientific Compiler Concept

Most AI-for-science platforms use language models to generate scientific content. Hordago takes a different approach: we treat scientific workflows as compilation pipelines where the AI assists with code and configuration, but never touches the deterministic core.

The Compiler Metaphor

A traditional compiler takes source code and produces machine code. The process is deterministic — the same input always produces the same output. You can inspect the intermediate representations, trace any output back to its source, and verify correctness at each stage.

Hordago applies this pattern to scientific workflows:

Compiler Concept	Hordago Equivalent
Source code	Experimental data + analysis spec
Lexer / Parser	Data ingestion + validation
Intermediate representation	Typed scientific artifacts
Code generation	Report compilation
Debug symbols	Provenance manifests

The key insight: the LLM is the programmer, not the compiler. AI helps you write the analysis spec and configure the pipeline. But once the spec is defined, execution is deterministic.

If you’re a bench scientist, think of it like a protocol that always runs the same way — your data in, validated figures out. The “compiler” is just the automated version of that reliability.

Three-Layer Architecture

Hordago separates concerns into three layers:

Layer 1: Domain Operating Systems (DomainOS)

Thin product shells tailored to specific scientific domains. CRISPRos for gene editing, GWASos for population genetics. Each DomainOS provides domain-specific UI, terminology, and workflows — but delegates computation to shared engines.

DomainOS products are intentionally thin. They define what a workflow does in domain terms. They don’t implement how the computation runs.

Layer 2: Shared Scientific Engines

Reusable computation engines that power multiple DomainOS products:

Figure Engine — Generates publication-quality figures from data artifacts. Same engine powers figures in CRISPRos and GWASos.
Statistics Engine — Runs statistical tests and produces typed results. Supports multiple testing correction, effect size calculation, power analysis.
Provenance Engine — Tracks every input, transformation, and output. Generates audit trails and provenance manifests.
biocontext7 — Resolves bioinformatics tool references. Ensures AI assistants use real tools with correct APIs.

When we improve an engine, every DomainOS product benefits. When we add a new DomainOS, it ships with mature, tested engines from day one.

Layer 3: Execution Infrastructure

The platform layer handles deployment, scaling, and orchestration. Scientific workflows run on reproducible infrastructure — containerized environments with pinned dependencies and deterministic builds.

Why Not Just Use an LLM?

Language models are excellent at generating plausible text. That’s the problem. In science, plausible and correct are different things.

Consider a Manhattan plot for a GWAS study. An LLM could describe what the plot should look like. But generating the actual plot requires:

Reading the summary statistics file (specific format, specific columns)
Computing -log10(p-values) for each variant
Mapping variants to chromosomal positions
Applying genome-wide significance thresholds
Rendering the figure with correct axes and labels

Each step is a deterministic computation. If the LLM generates any of these numbers, you can’t trust the figure. If the compiler produces them from your data, you can trace every dot on the plot back to a row in your input file.

Provenance as a First-Class Concept

Every artifact produced by Hordago includes a provenance manifest — a machine-readable record of:

Inputs: What data was used, where it came from, when it was accessed
Transformations: Which engine processed it, which version, which parameters
Outputs: What was produced, checksums, timestamps

This isn’t logging. It’s the audit trail that makes reproducibility possible. When a reviewer asks “how did you compute this p-value?” the answer is in the manifest, not in a researcher’s memory.

Building in the Open

Hordago’s scientific engines are open source. The compilation pipeline, provenance system, and tool registry (biocontext7) are all available on GitHub.

We believe scientific infrastructure should be transparent and auditable. If you can’t inspect the pipeline that produced your results, you can’t trust the results.

GitHub: Hordago-Labs
biocontext7: biocontext7.com
Perspectives: Standards for AI in science

Hordago Labs builds evidence-first AI tools for life sciences. Every claim traceable. Every figure reproducible.