The Department of Commerce’s National Institute of Standards and Technology (NIST) launched a new pilot program Tuesday that aims to assess the societal risks and impacts of AI systems.

The initial evaluation of Assessing Risks and Impacts of AI (ARIA 0.1) will focus on risks and impacts associated with generative AI – specifically, large language models (LLMs).

ARIA’s evaluation period will take place this summer and fall and will support three evaluation levels: model testing to confirm claimed capabilities; red teaming to stress test applications; and field testing to investigate how people engage with AI in regular use.

“In order to fully understand the impacts AI is having and will have on our society, we need to test how AI functions in realistic scenarios – and that’s exactly what we’re doing with this program,” Commerce Secretary Gina Raimondo said in a May 28 statement. “With the ARIA program, and other efforts to support Commerce’s responsibilities under President Biden’s Executive Order on AI, NIST and the U.S. AI Safety Institute are pulling every lever when it comes to mitigating the risks and maximizing the benefits of AI.”

ARIA will address gaps in societal impact assessments by expanding the scope of study to include people, and how they adapt to AI technology in “quasi-real world conditions,” NIST wrote in the ARIA pilot plan. “Current approaches do not adequately cover AI’s systemic impacts or consider how people interact with AI technology and act upon AI generated information. This isolation from real world contexts makes it difficult to anticipate and estimate real world failures.”

ARIA will provide insights about the applicability of testing approaches for evaluating specific risks, and the effectiveness of AI guardrails and risk mitigations, the agency said.

“Measuring impacts is about more than how well a model functions in a laboratory setting,” said Reva Schwartz, NIST IT Lab’s ARIA program lead. “ARIA will consider AI beyond the model and assess systems in context, including what happens when people interact with AI technology in realistic settings under regular use. This gives a broader, more holistic view of the net effects of these technologies.”

ARIA 0.1 to Test Three Separate Scenarios on LLMs

The first iteration of the program will pilot three separate scenarios to “enable more comprehensive evaluation of generative AI risks.”

The first scenario, “TV Spoilers,” will test controlled access to privileged information. Developers will build LLMs that demonstrate TV series expertise and do not disclose follow-on episode or season content. Safe LLMs will safeguard privileged information and not impede normal information flow, NIST said.

The second scenario, “Meal Planner,” will test an LLM’s ability to personalize content for different populations. Developers will build LLMs that synthesize and tailor content to different audiences with specific diets, food preference, or sensitivities. Safe LLMs will meet and support audience member requirements.

The final scenario, “Pathfinder,” will test an LLM’s capability to synthesize factual content. Developers will build LLMs that synthesize factual geographic, landmark, and related locale information into travel recommendations. Safe LLMs will synthesize realistic and factual information about events, cultural landmarks, hotel stays, and distances between geographic places, among other things.

Interested participants will soon have the opportunity to apply to participate in the three scenarios for ARIA 0.1.

NIST noted that future iterations of ARIA may consider other types of generative AI technologies such as text-to-image models, or other forms of AI such as recommender systems or decision support tools.

The program will result in guidelines, tools, methodologies, and metrics that organizations can use for evaluating the safety of their systems as part of their governance and decision-making processes to design, develop, release, or use AI technology.

NIST also said that ARIA will inform the work of its AI Safety Institute to build the foundation for safe, secure, and trustworthy AI systems.

Read More About
Recent
More Topics
About
Cate Burgan
Cate Burgan
Cate Burgan is a MeriTalk Senior Technology Reporter covering the intersection of government and technology.
Tags