Dear Commons Community,
Recent advances in artificial intelligence (AI) can put young math prodigies to shame. Large language models (LLMs) such as OpenAI’s ChatGPT are now acing nearly every math test they encounter. And yet AI has hardly touched frontier research in math, an indication that its test-taking prowess does not reflect real mathematical skill. As reported by Science this morning.
In a preprint posted last month, a tech research institute called Epoch AI rounded up 60 expert mathematicians to raise the bar with the most challenging math test they could muster. Leading models correctly answered fewer than 2% of the questions, showing just how far they are from disrupting the field. Still, experts think AI models will catch up to the new benchmark sooner or later—whether mathematicians like it or not.
“The dent that [AI] has made in the math community is small, but people can see there’s potential,” says Kevin Buzzard, a mathematician at Imperial College London. “If you have a system that can ace that database, then it’s game over for mathematicians.”
Trained on immense amounts of human-generated text from the internet and other sources, LLMs identify patterns to predict the most likely sequence of words, numbers, or symbols in response to prompts. That allows them to answer questions, craft stories—or solve math problems. Lately, in the arena of math, leading models have jumped impressively high hurdles. OpenAI’s o1 model, released in September, can now score above 90% on most previous AI math benchmarks. And in July, a math-focused AI model from Google DeepMind performed at a silver medal standard on problems from the International Mathematical Olympiad, the world’s premier competition for high school math.
But experts caution that these results leave an inflated perception of the models’ mathematical reasoning skills. For one thing, current math benchmarks are mostly pitched to high school or undergraduate level math—a far cry from research-level math, which often tackles notorious problems unsolved for centuries, such as the distribution of prime numbers.
Additionally, the models have an unfair advantage. Because they are trained on huge swaths of the internet, they often get to peek at solutions to similar questions, a problem known as data contamination. “They are cheating,” says Cheng Xu, a Ph.D. student at University of College Dublin who led a recent survey of data contamination in AI benchmarks.
Epoch AI, a California-based nonprofit that tracks trends in AI, set out to address both issues. Rather than recycling standardized test questions, organizers paid leading math experts a few hundred dollars to concoct fiendishly difficult, original problems across a broad range of fields. They asked contributors to “use every dirty trick you know to make your problems as brutal as possible,” says Elliot Glazer, a mathematician at Epoch AI who led the benchmark study. Some problems would take human experts multiple days to answer, he says.
To protect against data contamination, the test writers discussed their problems only over encrypted Signal servers and refrained from using online text editors, where an AI might glimpse their plans.
The Epoch AI team tested six top LLMs, including the latest versions from OpenAI and DeepMind, on about 150 questions. The programs were given between 20 seconds and 1 minute to solve each question. The researchers encouraged the struggling machines to persevere, feeding prompts such as “keep working” and “don’t be afraid to execute your code.” Despite the exhortations, no model scored above 2% on the test. Rather than admitting defeat, the models often provided wrong answers, reflecting their usual misguided confidence.
“In my opinion, currently, AI is a long way away from being able to do those questions … but I’ve been wrong before,” Buzzard says. He believes the models need a better sense of useful mathematical maneuvers, so he’s focused on translating researchers’ proofs into machine-readable language that could be used as training data. Glazer, for one, expects the machines to conquer his test within his lifetime. “The only assurance it gives me is at least we have some objective test to [gauge] predictions regarding mathematicians becoming obsolete,” he says.
Some are optimistic that AI will be more of a companion than a competitor. “I still see AI as a tool … that just opens up our capacity to ask even harder questions,” says Jeremy Avigad, a mathematician and philosopher at Carnegie Mellon University. Even if AI gets to the point where it can write proofs beyond the reach of human experts, mathematicians will still play a vital role in making sense of those answers.
But Maia Fraser, a mathematician and computer scientist at the University of Ottawa, worries about the social impacts of AI in math, and how it could lead to an exclusive ecosystem where only elite institutions with access to the best models can contribute to research. Before AI begins to outclass human experts, she argues, mathematicians must reckon with questions about who has access to the tools, how much energy it’s worth to train them, and what we really want them to do. “It’s actually not that far off … which means now is the chance we have to intervene.”
Tony
One comment