Reproducible vs Black Box
What reproducibility actually requires in AI-assisted science: provenance manifests, deterministic scripts, cryptographic hashing, and grounded text validation.
Reproducibility is an engineering problem, not a policy problem
Every scientific journal, funding agency, and regulatory body now emphasizes reproducibility. Policies have been written. Guidelines have been published. And yet, the reproducibility crisis persists — not because researchers lack good intentions, but because most computational tools make reproducibility possible in theory and impractical in reality.
The gap between a reproducible analysis and a black-box analysis is not about whether you could reproduce the work. It is about whether your tools make reproduction the default — or an afterthought requiring heroic effort.
A reproducible workflow is one where the effort to reproduce is less than the effort to doubt.
What reproducibility actually requires
Reproducibility in computational research is not a single property. It is a stack of requirements, each building on the last. Remove any layer and the whole structure becomes unreliable.
The reproducibility stack
| Layer | Requirement | What it means |
|---|---|---|
| Data | Immutable inputs with integrity verification | Raw data files with SHA-256 checksums; any modification creates a new version |
| Environment | Exact computational environment specification | OS, language version, library versions, system dependencies — all pinned |
| Code | Version-controlled transformation scripts | Git-tracked code with meaningful commit history; no “magic” notebooks |
| Pipeline | Deterministic execution order | A build system or workflow manager that runs steps in the correct order |
| Provenance | Complete lineage tracking | A manifest recording every input, transformation, output, timestamp, and hash |
| Validation | Automated output verification | Tests that confirm outputs match expected properties (dimensions, ranges, types) |
| Text | Grounded claims | Every sentence in the output references a specific artifact in the manifest |
Most tools address one or two layers. A version-controlled notebook handles code but not environment. A container handles environment but not provenance. A writing assistant handles text but not grounding.
Reproducibility requires all seven layers working together.
The black-box spectrum
“Black box” is not binary. Tools fall on a spectrum from fully transparent to completely opaque, and many tools that appear transparent have opaque components buried in their pipelines.
Recognizing black-box behavior
A tool exhibits black-box behavior when any of the following are true:
- Non-deterministic outputs: Running the same analysis twice with the same inputs produces different results (common with AI-generated text, stochastic algorithms without fixed seeds, or API calls to services that update their models)
- Missing provenance: You cannot determine what version of the code, what version of the model, or what configuration produced a specific output
- Opaque transformations: A step in the pipeline transforms data in a way you cannot inspect or verify (common with hosted APIs that accept data and return results)
- Implicit dependencies: The analysis depends on external state — a database that has been updated, a model that has been retrained, a service that has changed its behavior
If you cannot answer “what exactly produced this output?” with a specific commit hash, input hash, and environment specification, you are operating a black box.
The black-box audit
For each step in your computational pipeline, answer these questions:
- Is the output deterministic? Given identical inputs and environment, will this step always produce byte-identical output?
- Is the transformation inspectable? Can you read the source code that performs this step?
- Is the environment reproducible? Can you recreate the exact computational environment where this step ran?
- Is the provenance recorded? Is there a machine-readable record of inputs, outputs, versions, and timestamps?
- Is the output verifiable? Can you confirm that the output has expected properties without re-running the analysis?
Any “no” answer identifies a reproducibility gap.
Provenance manifests: the engineering solution
A provenance manifest is a structured record of everything that contributed to a research output. It is the engineering solution to the reproducibility problem — not a policy document, but a machine-readable artifact that can be automatically generated and verified.
What a manifest contains
{
"version": "1.0",
"created": "2026-03-13T10:30:00Z",
"inputs": [
{
"file": "data/rnaseq_counts.csv",
"sha256": "a3f2b8c9d1e4f5a6b7c8d9e0f1a2b3c4d5e6f7a8",
"rows": 58347,
"columns": 12
}
],
"environment": {
"os": "Ubuntu 22.04",
"python": "3.11.7",
"packages": "requirements.lock"
},
"steps": [
{
"script": "scripts/normalize.py",
"git_sha": "abc123",
"duration_seconds": 12.4,
"output_hash": "b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2c3"
}
],
"outputs": [
{
"file": "figures/volcano_plot.png",
"sha256": "c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2c3d4",
"referenced_in": ["results.md:L42", "results.md:L67"]
}
]
}
This manifest serves multiple purposes:
- Verification: Anyone can check that the output hashes match, confirming the outputs haven’t been modified
- Reproduction: The environment and script versions are specified precisely enough to recreate the analysis
- Audit: The complete chain from raw data to final output is documented in a single, machine-readable file
- Grounding: The
referenced_infield connects each artifact to specific locations in the text, enabling automated grounded text validation
Deterministic scripts vs interactive notebooks
Interactive notebooks are popular for exploratory analysis, and for good reason — they enable rapid iteration and visual feedback. But they introduce reproducibility hazards that are well-documented and widely experienced:
- Execution order ambiguity: Cells can be run in any order, and the state of the kernel depends on the order of execution
- Hidden state: Variables persist between cells, creating implicit dependencies that are not visible in the notebook’s source
- Environment drift: The notebook runs in whatever environment is currently active, which may differ from the environment when the analysis was first performed
| Property | Interactive notebook | Deterministic script |
|---|---|---|
| Execution order | Arbitrary (user-controlled) | Fixed (script order or pipeline DAG) |
| Hidden state | Possible (persistent kernel variables) | Impossible (each run starts fresh) |
| Reproducibility | Requires careful discipline | Default behavior |
| Provenance | Manual (if tracked at all) | Automatic (build system + manifest) |
| Collaboration | Merge conflicts common | Standard git workflows |
This does not mean notebooks should be abandoned. It means they belong in the exploration phase of research, not the production phase. When an analysis moves from exploration to publication, it should be converted to deterministic scripts with provenance tracking.
Grounded text validation
The final layer of the reproducibility stack addresses a subtle but critical problem: even when data processing is fully reproducible, the text that describes the results can diverge from what the data actually shows.
Grounded text validation is an automated check that verifies every claim in a document references a specific artifact:
- “Expression of TP53 was significantly upregulated (Figure 2A, p = 0.003)” — valid, references Figure 2A and a specific p-value that can be verified against the statistics JSON
- “Our results suggest a promising therapeutic target” — invalid, makes a claim without referencing any specific artifact
- “We observed differential expression across 1,247 genes” — valid only if the number 1,247 appears in the statistics JSON or source data
Grounded text validation is not about limiting what you can write. It is about ensuring that what you write can be verified.
This validation can be automated. A script can parse the text, identify claims that reference artifacts, check those references against the manifest, and flag any claim that lacks a concrete reference. The result is a document where every factual statement is traceable to its source — not because the researcher was meticulous, but because the tooling enforced it.
Building a reproducible pipeline
If your current pipeline has black-box components, addressing them does not require rebuilding everything at once. Start with the highest-risk steps: those that produce outputs appearing in publications or regulatory submissions.
For each step:
- Replace hosted API calls with self-hosted alternatives where possible
- Pin all dependencies (language version, library versions, system packages)
- Add provenance tracking (input hashes, output hashes, timestamps, git SHAs)
- Write validation tests that check output properties
- Convert notebooks to scripts for production runs
The tools and practices for reproducible computation are mature. Containers, lockfiles, build systems, and provenance manifests are well-understood engineering solutions. The remaining challenge is adoption — choosing tools that make reproducibility the default rather than an optional add-on.
Hordago Labs builds reproducible, evidence-first pipelines for biological research. Every output includes a provenance manifest with cryptographic verification. See our platform or read about why open-source matters for scientific reproducibility.