Reproducible vs Black Box AI Science Tools

Reproducibility is an engineering problem, not a policy problem

Every scientific journal, funding agency, and regulatory body now emphasizes reproducibility. Policies have been written. Guidelines have been published. And yet, the reproducibility crisis persists — not because researchers lack good intentions, but because most computational tools make reproducibility possible in theory and impractical in reality.

The gap between a reproducible analysis and a black-box analysis is not about whether you could reproduce the work. It is about whether your tools make reproduction the default — or an afterthought requiring heroic effort.

A reproducible workflow is one where the effort to reproduce is less than the effort to doubt.

What reproducibility actually requires

Reproducibility in computational research is not a single property. It is a stack of requirements, each building on the last. Remove any layer and the whole structure becomes unreliable.

The reproducibility stack

Layer	Requirement	What it means
Data	Immutable inputs with integrity verification	Raw data files with SHA-256 checksums; any modification creates a new version
Environment	Exact computational environment specification	OS, language version, library versions, system dependencies — all pinned
Code	Version-controlled transformation scripts	Git-tracked code with meaningful commit history; no “magic” notebooks
Pipeline	Deterministic execution order	A build system or workflow manager that runs steps in the correct order
Provenance	Complete lineage tracking	A manifest recording every input, transformation, output, timestamp, and hash
Validation	Automated output verification	Tests that confirm outputs match expected properties (dimensions, ranges, types)
Text	Grounded claims	Every sentence in the output references a specific artifact in the manifest

Most tools address one or two layers. A version-controlled notebook handles code but not environment. A container handles environment but not provenance. A writing assistant handles text but not grounding.

Reproducibility requires all seven layers working together.

The black-box spectrum

“Black box” is not binary. Tools fall on a spectrum from fully transparent to completely opaque, and many tools that appear transparent have opaque components buried in their pipelines.

Recognizing black-box behavior

A tool exhibits black-box behavior when any of the following are true:

Non-deterministic outputs: Running the same analysis twice with the same inputs produces different results (common with AI-generated text, stochastic algorithms without fixed seeds, or API calls to services that update their models)
Missing provenance: You cannot determine what version of the code, what version of the model, or what configuration produced a specific output
Opaque transformations: A step in the pipeline transforms data in a way you cannot inspect or verify (common with hosted APIs that accept data and return results)
Implicit dependencies: The analysis depends on external state — a database that has been updated, a model that has been retrained, a service that has changed its behavior

If you cannot answer “what exactly produced this output?” with a specific commit hash, input hash, and environment specification, you are operating a black box.

The black-box audit

For each step in your computational pipeline, answer these questions:

Is the output deterministic? Given identical inputs and environment, will this step always produce byte-identical output?
Is the transformation inspectable? Can you read the source code that performs this step?
Is the environment reproducible? Can you recreate the exact computational environment where this step ran?
Is the provenance recorded? Is there a machine-readable record of inputs, outputs, versions, and timestamps?
Is the output verifiable? Can you confirm that the output has expected properties without re-running the analysis?

Any “no” answer identifies a reproducibility gap.

Provenance manifests: the engineering solution

A provenance manifest is a structured record of everything that contributed to a research output. It is the engineering solution to the reproducibility problem — not a policy document, but a machine-readable artifact that can be automatically generated and verified.

What a manifest contains

{
  "version": "1.0",
  "created": "2026-03-13T10:30:00Z",
  "inputs": [
    {
      "file": "data/rnaseq_counts.csv",
      "sha256": "a3f2b8c9d1e4f5a6b7c8d9e0f1a2b3c4d5e6f7a8",
      "rows": 58347,
      "columns": 12
    }
  ],
  "environment": {
    "os": "Ubuntu 22.04",
    "python": "3.11.7",
    "packages": "requirements.lock"
  },
  "steps": [
    {
      "script": "scripts/normalize.py",
      "git_sha": "abc123",
      "duration_seconds": 12.4,
      "output_hash": "b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2c3"
    }
  ],
  "outputs": [
    {
      "file": "figures/volcano_plot.png",
      "sha256": "c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2c3d4",
      "referenced_in": ["results.md:L42", "results.md:L67"]
    }
  ]
}

This manifest serves multiple purposes:

Verification: Anyone can check that the output hashes match, confirming the outputs haven’t been modified
Reproduction: The environment and script versions are specified precisely enough to recreate the analysis
Audit: The complete chain from raw data to final output is documented in a single, machine-readable file
Grounding: The referenced_in field connects each artifact to specific locations in the text, enabling automated grounded text validation

Deterministic scripts vs interactive notebooks

Interactive notebooks are popular for exploratory analysis, and for good reason — they enable rapid iteration and visual feedback. But they introduce reproducibility hazards that are well-documented and widely experienced:

Execution order ambiguity: Cells can be run in any order, and the state of the kernel depends on the order of execution
Hidden state: Variables persist between cells, creating implicit dependencies that are not visible in the notebook’s source
Environment drift: The notebook runs in whatever environment is currently active, which may differ from the environment when the analysis was first performed

Property	Interactive notebook	Deterministic script
Execution order	Arbitrary (user-controlled)	Fixed (script order or pipeline DAG)
Hidden state	Possible (persistent kernel variables)	Impossible (each run starts fresh)
Reproducibility	Requires careful discipline	Default behavior
Provenance	Manual (if tracked at all)	Automatic (build system + manifest)
Collaboration	Merge conflicts common	Standard git workflows

This does not mean notebooks should be abandoned. It means they belong in the exploration phase of research, not the production phase. When an analysis moves from exploration to publication, it should be converted to deterministic scripts with provenance tracking.

Grounded text validation

The final layer of the reproducibility stack addresses a subtle but critical problem: even when data processing is fully reproducible, the text that describes the results can diverge from what the data actually shows.

Grounded text validation is an automated check that verifies every claim in a document references a specific artifact:

“Expression of TP53 was significantly upregulated (Figure 2A, p = 0.003)” — valid, references Figure 2A and a specific p-value that can be verified against the statistics JSON
“Our results suggest a promising therapeutic target” — invalid, makes a claim without referencing any specific artifact
“We observed differential expression across 1,247 genes” — valid only if the number 1,247 appears in the statistics JSON or source data

Grounded text validation is not about limiting what you can write. It is about ensuring that what you write can be verified.

This validation can be automated. A script can parse the text, identify claims that reference artifacts, check those references against the manifest, and flag any claim that lacks a concrete reference. The result is a document where every factual statement is traceable to its source — not because the researcher was meticulous, but because the tooling enforced it.

Building a reproducible pipeline

If your current pipeline has black-box components, addressing them does not require rebuilding everything at once. Start with the highest-risk steps: those that produce outputs appearing in publications or regulatory submissions.

For each step:

Replace hosted API calls with self-hosted alternatives where possible
Pin all dependencies (language version, library versions, system packages)
Add provenance tracking (input hashes, output hashes, timestamps, git SHAs)
Write validation tests that check output properties
Convert notebooks to scripts for production runs

The tools and practices for reproducible computation are mature. Containers, lockfiles, build systems, and provenance manifests are well-understood engineering solutions. The remaining challenge is adoption — choosing tools that make reproducibility the default rather than an optional add-on.

Hordago Labs builds reproducible, evidence-first pipelines for biological research. Every output includes a provenance manifest with cryptographic verification. See our platform or read about why open-source matters for scientific reproducibility.