popularmechanics.com

A.I. May Pass 'Humanity’s Last Exam' Within the Next 9 Months, Scientists Say

Humanity’s Last Exam is the ultimate academic test for AI, which challenges the tech to answer the most difficult questions experts could come up with.

For now, the AIs tested—which are all large language models, or LLMs—remain stumped, with only 3-14 percent accuracy.

Because AI “brains” evolve so quickly, the same LLMs are expected to be at least 50 percent accurate by the end of 2025.

Artificial intelligence evolves exponentially faster than the human brain has over the several million years of our existence. Now, researchers are giving AI the ultimate test of academic knowledge with what they call Humanity’s Last Exam (HLE). It was created for large language models (LLMs)—AIs trained on immense datasets, like the infamous ChatGPT—and is intended to stump AI as much as possible, in order to make it prove that it knows everything.

For the sake of transparency, the test was created and carried out by a team of experts from both the nonprofit Center for AI Safety (which works towards “reducing societal-scale risks from AI” and the for-profit Scale AI (which partners with tech giants in the AI space to provide data used to train AI algorithms). The evaluation of the results of this test, which were described in a study uploaded to the preprint server arXiv, have not yet been peer-reviewed.

Related Story

To see how capable they are, LLMs are evaluated on how well they perform with regards to benchmarks—sets of questions that each cover different subjects, from math to linguistics and beyond. The researchers encouraged academics to submit the most difficult questions they could think of and compiled these benchmarks from about 2,700 responses. Questions that could be easily answered by existing AI models were excluded, and as of now, various LLMs (including Gemini and DeepSeek) have scored anywhere from 3 to 14 percent.

But, if you ask one team of experts, they probably won’t be flunking for long. According to the study, most of the LLMs tested are expected to get to at least 50 percent accuracy by the end of 2025.

“HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading,” the researchers said in the study. “Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval.”

HLE is about 41 percent math, 11 percent biology and medicine, 10 percent computer science, 9 percent physics, 9 percent humanities and social science, 6 percent chemistry, 5 percent engineering and 9 percent other topics. For an idea of what the LLMs are up against, one question presents an ancient Roman inscription and asks the AI to translate it. Another asks about the number of paired tendons supported by a certain bone in hummingbirds. There are complex math equations, questions about missing links in chemical reactions, and even some questions testing how much AI knows about itself.

Because questions are so precise and have no gray area, answers are verified by another AI, GPT-40. This AI also accepts correct answers that may vary slightly. It’s not unlike giving the correct answer to a question on Jeopardy with fewer words or letters that do not change the answer itself, such as saying “what is T. rex” instead of “what is Tyrannosaurus rex.”

Related Story

AIs are failing spectacularly at the HLE, not only because that was the intent, but also because they can sometimes guess the right answer for questions that require a single response or randomly pick the right answer for multiple-choice questions. The next phase in training will involve teaching an AI model to recognize uncertainty rather than confidently give a wrong answer. It will be prompted to not only answer a question, but also provide a measure of its confidence from zero to 100 percent.

“While current LLMs achieve very low accuracy on HLE, recent history shows benchmarks are quickly saturated–with models dramatically progressing from near-zero to near-perfect performance in a short timeframe,” the researchers said in the study.

While the LLMs will moist likely eventually learn to figure out when they are uncertain about an answer, they aren’t about to feel guilty or inferior for it. AI hasn’t reached that level of sentience… yet.

Lettermark

Elizabeth Rayne is a creature who writes. Her work has appeared in Ars Technica, SYFY WIRE, Space.com, Live Science, Den of Geek, Forbidden Futures and Collective Tales. She lurks right outside New York City with her parrot, Lestat. When not writing, she can be found drawing, playing the piano or shapeshifting.

Read full news in source page