Hot take: I trust state-of-the-art agentic AI more than an average scientist.

However, science is often not about whom we trust but rather why we trust them.

That’s why in my academic research, I’ve been focusing on keeping AI use reproducible, if possible, or at least auditable when not. Where minds crack is cases when some combo of both is taking place, and neither is conceptually fitting well. Notably this is exactly the case with AI coding assistance and writing assistance – both hot tasks in academia. In that latter scenario, I show that a (dialectical) realist interpretation helps build AI trust without breaking the narrative.

Reproducible AI

Reproducibility is possible when we use AI (large language models or agentic products) as part of an engineering solution, that is, when AI does a particular thing, and we know how well it does it and can scale the reliability (with thanks to Rinat Abdullin and his community for sharing this Sayash Kapoor’s speech). This use of AI in academic research aligns with the so-called quantitative paradigm, most commonly associated with postpositivism yet not limited to it.

A concrete example of this is my collaboration with Martin Ho et al.: Conceptual framework for using large language models for title and abstract screening in evidence syntheses, accepted as an oral presentation at the AI for Public Health (AI4PH) symposium held from July 20–21, 2026, in Montréal. In this work we illustrate the application of reliability engineering principles for (semi-) automated selection of relevant academic papers for the inclusion in systematic or scoping reviews in pharmacoepidemiology. We’ll publish this when it is ready.

Auditable AI

Reproducibility is not feasible – or necessary – as AI is being increasingly used to satisfy exploratory information needs, as seen from within the qualitative paradigm. In that case recording an audit trail is one of a few things that can be done for establishing trustworthiness.

An example of this is in my recent preprint, Performance is necessary but not sufficient to make sense of large language models in evidence synthesis: a qualitative methodological study (doi:10.5281/zenodo.20250840), which I submitted as a course assignment for my comprehensive doctoral exam. There I used agentic AI to track down the most appropriate qualitative data analysis method for my study of how people made sense of systematic review automation over time. I reported full transcripts of all the AI interations in an open data supplement (cross-pinned on the InterPlanetary File System for data redundancy and integrity).

Realist AI

In practice, we people – transcendental mischief-makers – often use both quantitative and qualitative approaches without stopping to ask which is which. The holistic duality of this is in the very fact that both approaches exist and both are being used (not unusually, by the same people or within the same project).

As a dialectical realist, I cannot be bothered too much with delineating the distinction between the reproducibility and auditability approaches above except as helpful for didactic and sensemaking purposes. In practice, I go ahead and use both as applicable:

✅ AI process repro

When possible and when feasible, I try to keep things fully reproducible, like in Martin’s framework above.

In most cases, though, keeping the AI process itself reproducible is an operational pain due to the complexities of LLM inference, like its fragility across machines as well as the additional complexities that come with agentic environments as illustrated in Sayash Kapoor’s video speech above.

❌ AI process repro → ✅ AI artifact repro

Even when we’re struggling to capture the dynamic process, we still like to scrutinize its static outputs. This comes across loud and clear when I report lockfiles and regression test suites and fixtures in my open source code repositories. This keeps my software code – albeit largely AI-generated – runnable across machines and, through the regression testing, offering some reassurance that it works as intended.

An example of such “static scrutiny” is the Draw RDF plugin for the popular online diagramming software Draw.io that we developed together with the GLAM Incubator and a team at the Archives of Ontario – the work we presented at the 50th Anniversary Conference of the Association of Canadian Archivists on June 10, 2025, in Ottawa. This is a multi-layer software package written in two programming languages (Python and TypeScript) and executed directly in a web browser (i.e., under WebAssembly). In its version 0.1.0, I’d say it is rather rough and not something I’d push into production, but it has its use case and was tested on over forty manually crafted inputs assembled with the great team’s support.

While neither I nor anyone else could possibly reproduce the AI coding process there, the results of this process are publicly auditable and subject to adversarial tests. Because I know that process-level reproducibility was genuinely impractical on this project, as I was doing this part-time and running short on time, the auditable code and tests sound like an acceptable compromise.

Another example comes from the Research Integrity Project – Exploring diversity in Clarivate’s Highly Cited Researchers list, presented by Andrea Tricco on May 5, 2026, at the 9th World Conference on Research Integrity in Vancouver. I’ve been developing a reproducible data pipeline for this project, and the code is public.

Likewise, lockfiles, tests, the repository commit history, and pertinent handwritten documentation serve as an auditable trail to build trust in the pipeline and, by extension, the data we consume from it.

❌ AI process repro → ❌ AI artifact repro → ✅ AI audit trail

Let us consider the exploratory information needs use case from above but within a realist paradigm. Notable here is that not only the AI process is not reproducible but also the outputs are sort of single-use artifacts (i.e., AI’s answers and leads in response to the user query) as opposed to reusable artifacts above (i.e., software code or data). Because, again, this constitutes a good rationale for why more upstream reproducibility is not to be offered, a pivot is warranted. The pivot is the audit trail.

Examples of that, in addition to the chat transcripts in my comprehensive exam paper above, include the agentic rollouts, chats, and reports that document the AI coding process on both the Draw RDF plugin and research integrity project. The aim is to simply document the entire process as fully as possible so that interested parties could at least see what exact model or tool was used and when; what went in, how the context built up, and what was finally outputted – and compare with the final deliverable.

This, by the way, partially resolves concerns around “what was AI-generated” vs. what the human author contributed as long as we’re talking about academic research.

“No epistemic respect for bullshit machines or LLMs”?*

If none of this is going to earn your trust, then tell me what is!

« Previous | The End


Written by Pavel Zhelnov on May 25, 2026. Last revised May 25, 2026.

* A paper by Moti Mizrahi (2025).