In this second installment of our two-part story on the nearly completed six-year cloud transition at the Justice Department’s Bureau of Alcohol, Tobacco, Firearms and Explosives (ATF), Chief Technology Officer Mason McDaniel talks about the funding aspects and mission payoffs of this remarkable journey.
He also offers an enduring set of lessons for other organizations that are embarking on similar paths of technology transformation, including the need to undertake major course corrections when necessary, and the value of top agency leadership support when things don’t always go as planned (click here to revisit part one of the interview which was published on June 28.)
Payoff in Performance
MeriTalk: Let’s talk a little bit more about deployment speeds with applications that are now cloud-based. They seem really impressive.
McDaniel: When Roger and I got there in November 2015, for the most part application updates were not happening. We did not have application maintenance staff to do them.
Then we started getting some budget to rebuild some application O&M abilities, but even still it was just a couple people per application family. It was relatively minimal. The highest regular cadence we ever got to was monthly deployments to production, and for many applications it was a deployment per quarter or less frequently, so it was extremely slow.
MeriTalk: Versus the performance you are getting now, which can approach a couple dozen deployments a week?
McDaniel: Here’s a great example to contrast the two. We went live with our licensing portfolio on Dec. 23, 2021, which is the worst time of year ever to do a deployment like this, but that’s when we had to do it.
The first business day after go-live, our performance was awful. So we decided to take a 30- minute outage over lunch to double the capacity of our database servers. We took the systems offline and within 30 minutes we had completely rebuilt the database servers on much larger hardware, restarted the systems, and were up and running on the new systems. I was upset we had to be down for 30 minutes, because we are designed for zero downtime deployments. But that was our first business day, we were still working out the kinks, and that took us 30 minutes to do.
If we had been in our legacy environment, though, and found that we were under capacity on our database servers, that would have been literally a six-month endeavor to procure the hardware, get it shipped and delivered, get it installed, racked, cabled, set up, configured, get the approvals to deploy, and all that.
Funding the Transformation
MeriTalk: What is ATF’s annual tech spend?
McDaniel: About $100 million, give or take.
MeriTalk: How did you approach funding for the cloud move? Did you have to ask for a big bump up, or work within the budget?
McDaniel: First, we did not sell cloud to ATF leadership as a way to save money. It is true that cloud can be cheaper than on-premise systems if it is done right, but we had been underinvesting in IT, to the point of not having meaningful disaster recovery abilities or even redundant hardware. Even with lower costs in the cloud, doing IT right was going to cost us more than we had been spending. So, we were up front about that. It all started with honesty and providing data to back up our reasoning.
I cannot emphasize enough how incredibly supportive ATF’s executive leadership has been throughout. They gathered funding from our IT budget and across other areas of ATF to fund the initial work out of hide. ATF’s Enforcement Programs and Services directorate had the most to benefit from the work, and they played a huge role in helping to secure and provide additional funding to keep us going.
MeriTalk: It’s an overused phrase, but it sounds like you had to build the new airplane while you were flying the old one…
McDaniel: Very much so. When we decided to go live with the licensing portfolio, we knew we would not be finished with even the minimum required functionality. Our executive leadership decided it was more important to get it live in the cloud with the benefits that offered, and they were willing to accept potentially impacting parts of ATF’s missions for a short period while we finished up those functions. So, the metaphor I used was that we took the plane off before we built the nose gear. Our goal was just to get it on the ground with everybody safe.
MeriTalk: Taking it back to the financial question again – if the agency tech budget is remaining about the same, then it sounds like for roughly the same amount of money as it would have cost to keep things on-prem, ATF is getting this huge bump in capability for roughly the same dollars?
McDaniel: It is. Instead of seeing it as updating our technologies to save money, we are updating our processes and technologies so that we get a lot more functionality and mission support out of our existing budget. It is the same amount of money but we are getting so much more out of it from being in the cloud.
Mission Outcomes
MeriTalk: And then to take it to the real bottom line – mission – tell us about the tangible improvement there.
McDaniel: The first major difference only indirectly affects the mission, but it is a major factor in our ability to make improvements there. We can respond to user needs. For years, when our users saw bugs in systems, they stopped even asking for fixes because they knew they were not going to get them. Since 2016, when we started to rebuild application maintenance teams, that improved somewhat, but was still very limited. If we were making modest monthly updates, we were doing pretty well.
Now, with the architecture and automation we put in place, we can make production deployments mid-day with zero downtime. We can deploy small changes continuously, sometimes multiple deployments per day, completely transparent to users. In the first week after go-live, we made 21 production deployments. In the six months since then, we have made production deployments on 91 percent of all workdays, averaging 8 features deployed per day.
That has let us change from a mindset of “keeping the lights on” for production systems to a mindset of continuous improvement. We see these licensing capabilities not as static systems, but as a service which IT provides to our customers, that we will be continuously improving. From an IT perspective, that will also keep us from digging ourselves into the same technical debt hole that we found ourselves in.
MeriTalk: For the outside observer, can you explain how that improved technology then leads to improving on things central to the agency’s mission, with licensing, or tracing, or other aspects?
McDaniel: The improvements we’ve seen have been profound. Our legacy system could only handle about 100 simultaneous users before it slowed to a standstill and crashed. Last month we saw 4,500 simultaneous users. That’s more than a 4,000 percent increase in user capacity. We have seen a 16 percent increase in the total number of industry-submitted electronic forms (eForms) the ATF has been able to process this year compared to last year.
The difference was even larger for a new eForm we added with our cloud migration, called the Form 4. That form involves highly regulated National Firearms Act weapons, and they had to be submitted on paper before our cloud migration. Since we went live, our pace of processing existing paper forms has increased by over 300 percent, but adding in the new electronic Form 4s, our total Form 4 processing rates have increased by nearly 400 percent compared to last year.
MeriTalk: Turning it back to some of the major Federal IT policy steps over the past year – zero trust policy and others – it is much easier for a Federal agency that is close to moving to all-cloud to start working to implement those?
McDaniel: I’ll say it’s easier if you’re cloud-based and done right. Back to the earlier point – if you just lift and shift the crap you have today into the cloud, you’re going to just have crap in the cloud. That won’t help you respond to new policy.
But by building it out so everything we do is through automated deployments and we can do these quick deployment cycles, yes, that’s the kind of cloud environment (with the automation and governance policies around it) that enables us to respond much more quickly to those kinds of policy things.
That brings up one more critical point – you can build out automation technologies and great cloud-based technologies, but if you don’t update the governance processes to allow you to use them, then it doesn’t help. It doesn’t matter if you can build a virtual machine in five minutes if it takes you two months to get it through your change control process so you can actually use it. We rewrote our IT governance from the ground up around the concepts of continuous integration and continuous delivery (CI/CD) and automated deployment pipelines, so we can make these rapid changes while still tracking our deployments and managing our configurations as we need to.
MeriTalk: ATF sounds very far along on its cloud adoption journey. Having gone that far, however, is there anything that you are not going to put into the cloud for any reason? Anything that relates to ATF’s mission, perhaps?
McDaniel: For mission reasons? No. Anything can be made insecure anywhere, but if you do cloud right there is no government data center that can be as secure as a good cloud environment. Security is no longer a limiting factor.
We prefer to have our applications and data in the cloud and well-locked down there. Part of that is because of the ease with which we can update things and the ease with which we can avoid configuration creep. When you are sitting there patching the same systems over and over and over, it’s very difficult to accurately track every single one of those configuration changes to be sure you’re not making mistakes or getting out of sync, which can introduce security vulnerabilities.
Our new approach is that every time we update a system, we deploy a completely new system, test it, switch users onto that and keep the old one just long enough to make sure the new one is working well. Then we delete the old one. Every deployment is automatically built to a known good state, so we always know exactly what our configuration is. That helps us to be more secure.
Cloud Exceptions Declining
MeriTalk: In the really big picture, is there anything for which cloud is not appropriate right now, versus on-prem?
McDaniel: There’s one type of system that we still plan on keeping some sort of local on-prem infrastructure for, and that is high volume scanning where we are putting out a huge volume of image data from physical sources. It could be images or videos or anything that has huge multimedia data volumes. We cannot stream effectively from those high-volume scanner systems directly into our cloud infrastructure. So we are going to have some local caching where it sends the data to an on-premise storage system, and then we’ll sync to the cloud. That is one of the few areas that I would say that’s the case.
Until recently, I would have added high-performance compute (HPC) clusters. We’ve run those to do a lot of our advanced simulations and data analysis, because we could get performance on-prem that we could not get in the cloud. But we’ve done some proofs of concept recently and have found that is no longer the case.
Now we can stand up 2,000 large processors for a particular simulation that runs for 20-30 minutes, compared to hours or days on our HPC clusters, then shut the servers down and not pay for them when they’re not being used. We even found that we could complete extremely complex simulations and analyses in the cloud that would crash our on-premise HPC compute clusters.
Course Correcting, and Communications
MeriTalk: Lastly, any advice to share on how other agencies can avoid any pitfalls that ATF discovered along the way?
McDaniel: We learned a lot of lessons the hard way. I’m passionate about wanting to talk about our mistakes and lessons learned, to help others avoid them. At least make your own new mistakes. Don’t repeat ours. About a year of our migration time was thrown away with us going down the wrong path. One of the fundamental mistakes we made was assuming that our legacy systems actually worked the way we thought they did.
Back to my comments about paying off our technical debt first, and building existing functionality while redesigning the architectures around automation, so we would be able to rapidly evolve our business processes afterward. That general approach was fantastic and made our success possible, but how we went about it at first was a disaster. Since we planned to just rebuild our existing functionality and business processes on a modern framework, we thought we could use our legacy source code as the map for what our existing systems did. We would rewrite that code using modern languages, architectures, and methods. Effectively, the legacy source code was our set of requirements.
But the quality of that code was so much lower than we had anticipated. In one of our systems, 80 percent of the source code didn’t actually work. But when we handed it to our development teams, we did not know that. Moreover, we couldn’t tell them which 20 percent worked and didn’t work.
So trying to use our existing source code or existing systems as a roadmap utterly failed because the quality wasn’t good enough to be reliable. That led us to the second lesson, which is the most embarrassing for me personally. I knew that one of the core agile tenets is end-user stakeholder involvement. I still made the mistake of thinking that we had all the source code, so we knew what the systems did – we did not have to take up the valuable time of our mission experts beyond some scattered demos and questions. We did most of the development and testing internally within our IT group before bringing it back to the users.
That was a disaster. And as long as I had been a proponent of agile development, I knew better. You may be going through agile rituals and processes, but if you develop something for a year before showing the results to your customers like we did, that is still waterfall. That is not how you are going to succeed. I had rationalized that it was different because we had the source code as a roadmap, but I was wrong. Developing without very close end user engagement is what led to that throw-away year. Do not do that.
After that setback, we went back to Agile 101. Requirements had to come from our subject matter experts. We needed true Product Owners. That was a hard request because the subject matter experts you want to lead this kind of development are the ones who are crucial your day-to-day mission operations. Again, though, the support we received from ATF executive leadership was fantastic. We made the request, and they supported us. They committed five or six full-time subject matter experts to work with our developers every day on just these modernization efforts. And they truly picked their best.
That was what turned it around – having those experts in the development meetings verifying that every single user story in our backlog was written accurately with the correct acceptance criteria. Having them in there every single day, writing or verifying user stories, testing the new functionality as it was being written to make sure it worked the way they needed it to work. Giving real-time feedback when things did not work right instead of waiting for some scheduled testing phase.
The mission experts were what turned it around. There is no way we could have succeeded without them being in there with our developers day in and day out.
MeriTalk: In a long project with some ups and downs, what was the view from the top of the agency?
McDaniel: Maintaining executive support was crucial, especially since our dates kept slipping. By any traditional measure this program would have been considered a failure. We missed every date and budget goal that we had, and we went live without the required functionality working.
But when I said that to ATF leadership, our Acting Director and the executive leadership of our customer organizations emphasized that they all consider this an unqualified success. There was one reason for that. At every point along the way, through the ups and downs, they made every decision with full knowledge of reality.
Roger and I were transparent at every step. As soon as we had concerns, we briefed ATF leadership. When we discovered that the source code was not reliable, and that we had been going down the wrong path, we laid it out to them. As soon as we discovered any significant issue, we briefed it up to our Acting Director, along with how we got there, how we identified the issue, and what we were doing to correct course. At each point we clearly described what was known, and did our best to analyze and quantify potential impacts of the many unknowns that remained.
Because of that transparency, they never got blindsided by something. They learned as we learned. We shared both our successes and our failures. Years ago, at the FBI, I was in the room with FBI Director Robert Mueller when he said, “Decisions are only as good as the information on which they are based.” I’ve never forgotten that. I am a technologist, but ultimately my job was to give ATF leadership the best information I could, so they could make informed decisions about whether to continue supporting our work. We maintained their trust. That was crucial. They remained behind us through the tough times, and we all succeeded together.