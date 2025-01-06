The Department of Defense (DoD) Chief Digital and Artificial Intelligence Office (CDAO) has concluded a red teaming pilot that resulted in the identification of more than 800 vulnerabilities in the use of large language models (LLM) to enhance military medical services.

The Crowdsourced AI Red-Teaming (CAIRT) Assurance Program pilot focused on the use of LLM chatbots in the context of military medicine. According to the Jan. 2 announcement, the CAIRT program supports the DoD in generating grassroots, crowdsourced approaches to AI Assurance and AI Risk Mitigation.

The red teaming pilot aimed to identify potential system vulnerabilities and weaknesses in the use of emerging tools for clinical note summarization and a medical advisory chatbot.

The pilot, conducted by Humana Intelligence with collaboration from the Program Executive Office, Defense Healthcare Management Systems, and the Defense Health Agency (DHA), involved over 200 participants, including clinicians and analysts from DHA, Uniformed Services University, and the Services. The exercise compared three popular LLMs.

According to the department, the exercise uncovered “over 800 findings of potential vulnerabilities and biases related to employing these capabilities in these prospective use cases.”

“Since applying GenAI for such purposes within the DoD is in earlier stages of piloting and experimentation, this program acts as an essential pathfinder for generating a mass of testing data, surfacing areas for consideration, and validating mitigation options that will shape future research, development, and assurance of GenAI systems that may be deployed in the future,” said Matthew Johnson, CDAO’s lead for this initiative.

The department further explained that this exercise will result in “repeatable and scalable output” with the development of benchmark datasets to evaluate future vendors and tools for “alignment with performance expectations.”

Additionally, these results will help shape future DOD policies and best practices for responsible use of Generative AI (GenAI), ultimately improving military medical care.