Architecture Deep Dive: The Scientific Compiler Concept
Most AI-for-science platforms use language models to generate scientific content. Hordago takes a different approach: we treat scientific workflows as compilation pipelines where the AI assists with code and configuration, but never touches the deterministic core.
The Compiler Metaphor
A traditional compiler takes source code and produces machine code. The process is deterministic — the same input always produces the same output. You can inspect the intermediate representations, trace any output back to its source, and verify correctness at each stage.
Hordago applies this pattern to scientific workflows:
| Compiler Concept | Hordago Equivalent |
|---|---|
| Source code | Experimental data + analysis spec |
| Lexer / Parser | Data ingestion + validation |
| Intermediate representation | Typed scientific artifacts |
| Code generation | Report compilation |
| Debug symbols | Provenance manifests |
The key insight: the LLM is the programmer, not the compiler. AI helps you write the analysis spec and configure the pipeline. But once the spec is defined, execution is deterministic.
If you’re a bench scientist, think of it like a protocol that always runs the same way — your data in, validated figures out. The “compiler” is just the automated version of that reliability.
Three-Layer Architecture
Hordago separates concerns into three layers:
Layer 1: Domain Operating Systems (DomainOS)
Thin product shells tailored to specific scientific domains. CRISPRos for gene editing, GWASos for population genetics. Each DomainOS provides domain-specific UI, terminology, and workflows — but delegates computation to shared engines.
DomainOS products are intentionally thin. They define what a workflow does in domain terms. They don’t implement how the computation runs.
Layer 2: Shared Scientific Engines
Reusable computation engines that power multiple DomainOS products:
- Figure Engine — Generates publication-quality figures from data artifacts. Same engine powers figures in CRISPRos and GWASos.
- Statistics Engine — Runs statistical tests and produces typed results. Supports multiple testing correction, effect size calculation, power analysis.
- Provenance Engine — Tracks every input, transformation, and output. Generates audit trails and provenance manifests.
- biocontext7 — Resolves bioinformatics tool references. Ensures AI assistants use real tools with correct APIs.
When we improve an engine, every DomainOS product benefits. When we add a new DomainOS, it ships with mature, tested engines from day one.
Layer 3: Execution Infrastructure
The platform layer handles deployment, scaling, and orchestration. Scientific workflows run on reproducible infrastructure — containerized environments with pinned dependencies and deterministic builds.
Why Not Just Use an LLM?
Language models are excellent at generating plausible text. That’s the problem. In science, plausible and correct are different things.
Consider a Manhattan plot for a GWAS study. An LLM could describe what the plot should look like. But generating the actual plot requires:
- Reading the summary statistics file (specific format, specific columns)
- Computing -log10(p-values) for each variant
- Mapping variants to chromosomal positions
- Applying genome-wide significance thresholds
- Rendering the figure with correct axes and labels
Each step is a deterministic computation. If the LLM generates any of these numbers, you can’t trust the figure. If the compiler produces them from your data, you can trace every dot on the plot back to a row in your input file.
Provenance as a First-Class Concept
Every artifact produced by Hordago includes a provenance manifest — a machine-readable record of:
- Inputs: What data was used, where it came from, when it was accessed
- Transformations: Which engine processed it, which version, which parameters
- Outputs: What was produced, checksums, timestamps
This isn’t logging. It’s the audit trail that makes reproducibility possible. When a reviewer asks “how did you compute this p-value?” the answer is in the manifest, not in a researcher’s memory.
Building in the Open
Hordago’s scientific engines are open source. The compilation pipeline, provenance system, and tool registry (biocontext7) are all available on GitHub.
We believe scientific infrastructure should be transparent and auditable. If you can’t inspect the pipeline that produced your results, you can’t trust the results.
- GitHub: Hordago-Labs
- biocontext7: biocontext7.com
- Perspectives: Standards for AI in science
Hordago Labs builds evidence-first AI tools for life sciences. Every claim traceable. Every figure reproducible.