
As agencies race to put generative artificial intelligence (AI) into production, many are confronting the same challenge: The bigger and more “all-in-one” the model gets, the harder it becomes to predict outcomes, explain decisions, and demonstrate accountability.
That challenge – and a blueprint for addressing it – was at the center of a Red Hat Government Symposium session, “From Monoliths to Model Meshes: Achieving Predictability, Transparency, and Scale with Agentic AI,” moderated by Ben Cushing, chief architect for health and life sciences at Red Hat, with Susan Gregurick, associate director of data science at the National Institutes of Health (NIH).
Gregurick framed monolithic GenAI systems as powerful but structurally mismatched to high-stakes public-sector environments where auditability and traceability are required.
“Monolithic GenAI systems are a static snapshot in time,” she said. “And because of that, there’s going to be some systematic uncertainty.”
The issue isn’t only that models age. Gregurick also pointed to the inherent variability of AI outputs: “These are stochastic processes, and so there’s going to be algorithmic randomness that is going to also limit our predictability.”
Gregurick also noted that monolithic systems often make transparency and traceability difficult by design.
“We don’t often know how the large language models are trained, what the training parameters are, what the training data are,” she said. “It’s often very difficult to track that.”
She also pointed to a practical operational gap for regulated missions: “There’s almost no audit capabilities. So traceability is a particular issue in terms of … understanding the conclusion and how the results were determined.”
That has direct consequences for trust in AI outputs. “Every time there’s a sort of a hallucination or [an] error, there’s a lack of trust,” Gregurick said.
In government, where systems must withstand scrutiny and drive repeatable outcomes, unpredictability and lack of audit capabilities quickly become a governance problem. Cushing summarized the gap: “It’s like the data governance itself is missing.”
Policy friction and technical fragmentation collide
For NIH, the AI transparency challenge collides with another reality: Biomedical data is both abundant and difficult to safely reuse at scale, particularly across the agency’s ecosystem of intramural teams and extramural partners.
“At NIH, we often have a patchwork of different data systems that have very complicated data sharing and data access requirements, so it’s not uniform,” Gregurick said.
A major constraint, she noted, comes from consent. “Whatever was in the original IRB (Institutional Review Board) and the original informed consent … takes precedence on how that data is shared,” she said. But “it’s not a uniform, standardized IRB consent process.”
This makes secondary reuse of data – including training AI – especially difficult, Gregurick noted. “It’s incompatible,” she said. “That is a big problem.”
NIH is also working through the mechanics of controlled access. It does not have a single portal for access to all NIH data that could be used for training algorithms, she said.
“Many of our data resource repositories are siloed. They’re in different storage systems,” Gregurick said. “The data is a little bit of heterogeneous wild west.”
And when data sharing is allowed, moving it is often economically unrealistic. Cushing emphasized data gravity, noting that moving petabytes of data “into an extramural research facility is not only time-consuming, but incredibly costly.”
Gregurick agreed and described NIH’s push to keep compute closer to where data already lives. “We absolutely do not want to see data copied across different platforms,” she said. “When you’re working across clouds, there are ways to compute without moving the data.”
Agentic microservices address some tech and policy challenges
The answer to many of these technical and policy challenges is to decompose monolithic AI into modular components – a model mesh of smaller agents and services that can be versioned, governed, and audited, Cushing and Gregurick observed.
Cushing described the goal as breaking GenAI down “into agentic microservices, or micromodel services” to improve scientific reproducibility, auditability, and transparency.
Modularity creates practical control points that monoliths lack, Gregurick noted. “We have the ability to really do a lot of version control in those modules,” she said. “Researchers can cite the exact version of the agent used. And that will allow others to reproduce this work.”
She also tied microservices to reduced risk: “Agentic microservices might reduce the role of hallucinations,” and can support model lineage to “clearly delineate how the data and the training history from each specific agent is used.”
Cushing connected the approach to a familiar engineering practice: observability. “As soon as I started working [with] microservices and started to look at the logs of the conversation between applications … [I] started to realize that you have a system, a chain of thought, and that produces artifacts you can audit, you can reproduce.”
Applied to agentic AI, he said, “you want to be able to see the conversation that occurs between each agent,” and use those logs to “reproduce it, audit it, whatever we need to do … to build trust.”
Adversarial validation builds guardrails around agentic AI
Cushing and Gregurick also explored oversight mechanisms that can be layered onto agentic systems, including adversarial validation and “model as judge” patterns.
To build trust in agentic AI systems, Cushing and Gregurick pointed to adversarial validation layers that test models before they reach production. For example, “model as judge patterns” can add accountable oversight, Cushing noted. In controlled-access data environments, Gregurick emphasized the value of being able to “understand and intercept input and output” so agencies can catch risky prompts or unsafe responses early, as well as adversarial methods to improve uncertainty quantification and support real-time monitoring of model drift and bias.
At NIH, more than 100 active grant research projects are developing adversarial networks, and especially generative adversarial networks, Gregurick noted. These projects include the development of a scalable risk prediction model that fuses imaging data with non-imaging data to detect pancreatic cancer in asymptomatic patients. It uses an adversarial de-biasing technique so risk scores can be generalized across different patient populations, Gregurick said.
MCP servers could provide AI-native governance
Another emerging piece of the model mesh ecosystem is the use of Model Context Protocol (MCP) servers to let AI agents query data repositories with policy awareness and traceability built in. NIH is piloting the approach with two projects, Gregurick said.
One project is an MCP server that enables agents to search and retrieve metadata across the PubMed, a search engine that provides access to more than 39 million citations of biomedical literature. The other is an effort to use MCP servers with NIH’s RePORTer data, a searchable database of NIH-funded research projects, and link that data to USASpending.gov, which provides data on all federal spending. The aim is to support cross-government portfolio analysis.
For controlled data, Gregurick suggested MCP-style access could unlock something agencies have struggled to build: embedded governance. “That AI-native governance … that’s something that we aren’t doing right now,” she said. “The audibility that would be provided by MPC servers would be outstanding.”
Successful AI scaling depends on design
Ultimately, Cushing and Gregurick suggested, scaling AI responsibly in government will depend less on choosing the model and more on designing the system: modular services, auditable interactions, policy-aware data access, and continuous validation that can show not just an answer, but also how an answer was produced. This approach offers a pragmatic path to operationalizing AI where predictability, transparency, and trust are built in.
Watch the Red Hat Government Symposium session: “From Monoliths to Model Meshes: Achieving Predictability, Transparency, and Scale with Agentic AI,” and explore more sessions from the Red Hat Government Symposium.