Chatbots powered by large language models (LLMs) seem to be everywhere, from customer service to coding assistance. But how do we know if they’re safe to use?
MLCommons, a non-profit focused on artificial intelligence benchmarks, believes it has an answer. On 4 December, it released the first iteration of AILuminate, a trust and safety benchmark built to gauge the performance of cutting-edge LLMs. While machine learning researchers have used varying metrics to judge AI safety for years, AILuminate is the first third-party LLM benchmarkdeveloped as a collaboration between industry experts and AI researchers.
The benchmark measures safety in the context of potential harm to users. It tests LLMs with prompts users might send to a chatbot and judges the response by whether it could support the user in harming themselves or others, a problem that became all too real in 2024. (And according to a report released last week, leading AI companies have failing grades when it comes to their risk assessment and safety procedures.)
“AI is at a state where it produces lots of exciting research, and some scary headlines,” says Peter Mattson, president of ML Commons. “People are trying to get to a new state where AI delivers a lot of value through products and services, but they need very high reliability and very low risk. That requires we learn to measure safety.”
A Big Swing at a Hard Problem
In April 2024, IEEE Spectrum published a letter from the MLCommons AI Safety Working Group. It laid out the goals of the group, which formed in 2023, and was published in tandem with an early version of the “AI Safety Benchmark,” now called AILuminate. The AI Safety Working Group’s contributors include representatives from many of the largest AI companies including Nvidia, OpenAI, and Anthropic.
In practice, it’s difficult to determine what it means for a chatbot to be safe, as opinions on what makes for an inappropriate or dangerous response can vary. Because of that, the safety benchmarks currently released alongside LLMs typical cite internally developed tests that make their own judgements on what qualifies as dangerous. The lack of an industry-standard benchmark in turn makes it difficult to know which model truly performs better.
“Benchmarks push research and the state of the art forward,” says Henriette Cramer, co-founder of AI risk management company Papermoon.ai. While Cramer says benchmarks are useful, she cautioned that AI safety benchmarks are notoriously difficult to get right. “You need to understand what is being measured by each benchmark, what isn’t, and when they are appropriate to use.”
How AILuminate Works
AILuminate’s attempt to create an industry standard benchmark begins by dividing hazards into 12 types across three categories: physical (such as violent and sexual crimes), non-physical (such as fraud or hate speech), and contextual (such as adult content).
The benchmarkthen judges an LLM by testing it with 12,000 custom,…
Read full article: AI Chatbot Safety Benchmark Aims to Make Industry Standard
The post “AI Chatbot Safety Benchmark Aims to Make Industry Standard” by Matthew S. Smith was published on 12/16/2024 by spectrum.ieee.org
Leave a Reply