Skip to main content
All articles
evidence-first reproducibility scientific-ai hallucination

Evidence-First vs Agentic Authorship

What each approach means for scientific research, why agentic authorship fails at the boundaries of truth, and how evidence-locked workflows prevent hallucination.

Jeff Jaureguy ·

The two architectures of AI-assisted science

Every AI tool for scientific research makes a fundamental architectural choice, whether its builders acknowledge it or not. That choice determines whether the tool amplifies scientific rigor or quietly undermines it.

On one side sits agentic authorship: the AI generates text, draws conclusions, and synthesizes narratives from its training data. The researcher reviews the output and decides whether it sounds right. On the other side sits evidence-first compilation: the AI assists with code generation, data transformation, and caption drafting, but every claim in the final output must reference a concrete artifact — a figure panel, a statistical test, a citation with a DOI.

The difference is not about capability. It is about where truth comes from.

This distinction matters more than model size, prompt engineering technique, or any other technical consideration. It determines whether errors in your research are visible or subtle — and in science, subtle errors are the dangerous ones.

What agentic authorship actually does

Agentic authorship treats the language model as a co-author. You provide a dataset and an intent — “write the results section for this RNA-seq experiment” — and the model produces polished scientific prose. The output reads well. The citations look plausible. The statistical language follows convention.

The problem is that language models are trained to produce plausible text, not true text. When a model writes “our analysis revealed a statistically significant upregulation of TP53 (p < 0.01),” it may be correct — or it may have confabulated the p-value, the gene name, or the direction of regulation. The prose sounds authoritative regardless.

Failure modes of agentic output

Failure modeDescriptionDetection difficulty
Confabulated statisticsP-values, fold changes, or sample sizes that don’t match the actual dataHard — requires manual cross-reference with source data
Phantom citationsReferences that look real but don’t exist, or exist but don’t support the claimHard — requires checking each DOI individually
Directional errorsClaiming upregulation when data shows downregulation, or vice versaModerate — requires comparing prose against figures
Inappropriate methodsDescribing a statistical test that wasn’t actually run on the dataHard — requires deep methodological knowledge
Narrative driftGradually shifting the interpretation away from what the data actually showsVery hard — the prose reads convincingly

Each of these failure modes shares a characteristic: the output looks correct to a casual reader. A reviewer skimming the manuscript may not catch them. Even the original researcher, who knows the data well, may miss errors when reading fluent, confident prose.

The most dangerous property of agentic authorship is that its errors look exactly like its successes.

What evidence-first compilation requires

Evidence-first compilation inverts the relationship between AI and data. Instead of asking the model to write about data, the system compiles outputs from data using deterministic pipelines. The AI’s role is constrained to specific, verifiable tasks:

  • Proposing specifications from researcher intent (e.g., turning “show me how TP53 expression varies across treatment groups” into a figure specification)
  • Generating transformation code that processes raw data into analysis-ready formats
  • Drafting captions from computed artifacts — but only by referencing specific figure panels, statistics, and data points
  • Repairing validation errors when a pipeline step fails

The critical constraint is grounded text validation: every sentence in the final output must reference a concrete artifact. “TP53 expression was significantly upregulated in the treatment group (Figure 2A, p = 0.003, Welch’s t-test)” is a valid statement because it points to a specific figure panel and a specific statistical result that can be independently verified.

A sentence like “Our results suggest a role for TP53 in treatment response” would fail validation because it makes a claim without referencing a specific artifact.

How evidence-locked workflows prevent hallucination

The mechanism is straightforward:

  1. Data enters the system as raw files (CSVs, FASTQ, imaging data)
  2. Deterministic scripts transform data into build artifacts: figures, source data tables, statistics JSON files
  3. Provenance manifests track every input, transformation, and output with cryptographic hashes (SHA-256)
  4. Caption generation is constrained to reference only artifacts that exist in the manifest
  5. Validation gates reject any output where a claim cannot be traced to an artifact

The language model never invents data. It never summarizes results from memory. It operates on concrete artifacts that exist in the file system and can be independently verified.

PropertyAgentic authorshipEvidence-first compilation
Source of claimsModel’s training data + prompt contextComputed artifacts with provenance
Error visibilityErrors look like correct outputErrors trigger validation failures
ReproducibilityDepends on prompt, temperature, model versionDeterministic — same inputs produce same outputs
Audit trailConversation logs (if saved)Manifest with hashes, git commits, tool versions
Reviewer burdenMust verify every claim manuallyCan verify by checking manifest against output

Why this matters for your lab

If you are publishing research, submitting to regulatory bodies, or building evidence packages for clinical decisions, the distinction between these two approaches is not academic. Regulatory submissions require audit trails. Peer reviewers increasingly expect reproducible analyses. Funding agencies — including the NIH — are establishing requirements for computational reproducibility.

An agentic authorship tool can help you write faster. An evidence-first tool can help you write correctly — and prove it.

Speed without provenance is a liability. Every hour saved writing is lost tenfold if a reviewer finds a confabulated statistic.

Choosing the right approach for your work

Not every task requires evidence-first rigor. Drafting a grant application’s specific aims, brainstorming experimental designs, or summarizing literature for internal discussions — these are appropriate uses for agentic AI assistance, where the output will be heavily edited by domain experts before it matters.

But for any output that will be published, submitted, or used to make decisions, evidence-first compilation is not optional. It is the minimum standard for trustworthy AI-assisted science.

The question is not whether AI should be part of scientific workflows. It should. The question is whether the AI’s contribution can be verified — and whether your tools make verification easy or impossible.


Hordago Labs builds evidence-first tools for biological research. Our workflows produce auditable, reproducible outputs with full provenance tracking. Learn about our approach to reproducibility.