Evidence-First vs Agentic Authorship
What each approach means for scientific research, why agentic authorship fails at the boundaries of truth, and how evidence-locked workflows prevent hallucination.
The two architectures of AI-assisted science
Every AI tool for scientific research makes a fundamental architectural choice, whether its builders acknowledge it or not. That choice determines whether the tool amplifies scientific rigor or quietly undermines it.
On one side sits agentic authorship: the AI generates text, draws conclusions, and synthesizes narratives from its training data. The researcher reviews the output and decides whether it sounds right. On the other side sits evidence-first compilation: the AI assists with code generation, data transformation, and caption drafting, but every claim in the final output must reference a concrete artifact — a figure panel, a statistical test, a citation with a DOI.
The difference is not about capability. It is about where truth comes from.
This distinction matters more than model size, prompt engineering technique, or any other technical consideration. It determines whether errors in your research are visible or subtle — and in science, subtle errors are the dangerous ones.
What agentic authorship actually does
Agentic authorship treats the language model as a co-author. You provide a dataset and an intent — “write the results section for this RNA-seq experiment” — and the model produces polished scientific prose. The output reads well. The citations look plausible. The statistical language follows convention.
The problem is that language models are trained to produce plausible text, not true text. When a model writes “our analysis revealed a statistically significant upregulation of TP53 (p < 0.01),” it may be correct — or it may have confabulated the p-value, the gene name, or the direction of regulation. The prose sounds authoritative regardless.
Failure modes of agentic output
| Failure mode | Description | Detection difficulty |
|---|---|---|
| Confabulated statistics | P-values, fold changes, or sample sizes that don’t match the actual data | Hard — requires manual cross-reference with source data |
| Phantom citations | References that look real but don’t exist, or exist but don’t support the claim | Hard — requires checking each DOI individually |
| Directional errors | Claiming upregulation when data shows downregulation, or vice versa | Moderate — requires comparing prose against figures |
| Inappropriate methods | Describing a statistical test that wasn’t actually run on the data | Hard — requires deep methodological knowledge |
| Narrative drift | Gradually shifting the interpretation away from what the data actually shows | Very hard — the prose reads convincingly |
Each of these failure modes shares a characteristic: the output looks correct to a casual reader. A reviewer skimming the manuscript may not catch them. Even the original researcher, who knows the data well, may miss errors when reading fluent, confident prose.
The most dangerous property of agentic authorship is that its errors look exactly like its successes.
What evidence-first compilation requires
Evidence-first compilation inverts the relationship between AI and data. Instead of asking the model to write about data, the system compiles outputs from data using deterministic pipelines. The AI’s role is constrained to specific, verifiable tasks:
- Proposing specifications from researcher intent (e.g., turning “show me how TP53 expression varies across treatment groups” into a figure specification)
- Generating transformation code that processes raw data into analysis-ready formats
- Drafting captions from computed artifacts — but only by referencing specific figure panels, statistics, and data points
- Repairing validation errors when a pipeline step fails
The critical constraint is grounded text validation: every sentence in the final output must reference a concrete artifact. “TP53 expression was significantly upregulated in the treatment group (Figure 2A, p = 0.003, Welch’s t-test)” is a valid statement because it points to a specific figure panel and a specific statistical result that can be independently verified.
A sentence like “Our results suggest a role for TP53 in treatment response” would fail validation because it makes a claim without referencing a specific artifact.
How evidence-locked workflows prevent hallucination
The mechanism is straightforward:
- Data enters the system as raw files (CSVs, FASTQ, imaging data)
- Deterministic scripts transform data into build artifacts: figures, source data tables, statistics JSON files
- Provenance manifests track every input, transformation, and output with cryptographic hashes (SHA-256)
- Caption generation is constrained to reference only artifacts that exist in the manifest
- Validation gates reject any output where a claim cannot be traced to an artifact
The language model never invents data. It never summarizes results from memory. It operates on concrete artifacts that exist in the file system and can be independently verified.
| Property | Agentic authorship | Evidence-first compilation |
|---|---|---|
| Source of claims | Model’s training data + prompt context | Computed artifacts with provenance |
| Error visibility | Errors look like correct output | Errors trigger validation failures |
| Reproducibility | Depends on prompt, temperature, model version | Deterministic — same inputs produce same outputs |
| Audit trail | Conversation logs (if saved) | Manifest with hashes, git commits, tool versions |
| Reviewer burden | Must verify every claim manually | Can verify by checking manifest against output |
Why this matters for your lab
If you are publishing research, submitting to regulatory bodies, or building evidence packages for clinical decisions, the distinction between these two approaches is not academic. Regulatory submissions require audit trails. Peer reviewers increasingly expect reproducible analyses. Funding agencies — including the NIH — are establishing requirements for computational reproducibility.
An agentic authorship tool can help you write faster. An evidence-first tool can help you write correctly — and prove it.
Speed without provenance is a liability. Every hour saved writing is lost tenfold if a reviewer finds a confabulated statistic.
Choosing the right approach for your work
Not every task requires evidence-first rigor. Drafting a grant application’s specific aims, brainstorming experimental designs, or summarizing literature for internal discussions — these are appropriate uses for agentic AI assistance, where the output will be heavily edited by domain experts before it matters.
But for any output that will be published, submitted, or used to make decisions, evidence-first compilation is not optional. It is the minimum standard for trustworthy AI-assisted science.
The question is not whether AI should be part of scientific workflows. It should. The question is whether the AI’s contribution can be verified — and whether your tools make verification easy or impossible.
Hordago Labs builds evidence-first tools for biological research. Our workflows produce auditable, reproducible outputs with full provenance tracking. Learn about our approach to reproducibility.