Quality Check of the newest Large Language Models (LLMs)

11/14/2025 | Aktuelle Meldungen

How good are current AI language models in comparison? Prof. Dr. Christian Stump investigates this by giving various language models difficult math problems to solve.

Mathematicians from all over the world contribute problems to his benchmark project. The models perform very differently: while the best one in the current benchmark can correctly solve over 40 percent of the tasks, the worst manages only 12 percent. All information is available online.

Stump began the project out of scientific curiosity: ‘I contributed some problems from my own research to certain benchmarks. I was interested in seeing which scientific questions the models are already capable of answering,’ he explains. ‘But these benchmarks were like a black box even for the participating researchers — the quality of the benchmarks couldn’t be assessed.

Read the full article (German)