I find it difficult to wade through all the hype about AI, along with the anecdotes about its failings to reliably answer basic questions.
Gerard Milburn kindly brought to my attention a nice paper that systematically addresses whether AI is useful as an aid (research assistant) for solving basic (but difficult) problems that researchers in condensed matter theorists care about.
CMT-Benchmark: A Benchmark for Condensed Matter Theory Built by Expert Researchers
The abstract is below.
My only comment is one of perspective. Is the cup half full or half empty? Do we emphasise the failures or the successes?
The optimists among us will claim that the success in solving a smaller number of these difficult problems shows the power and potential of AI. It is just a matter of time before LLMs can solve most of these problems, and we will see dramatic increases in research productivity (defined as the amount of time taken to complete a project).
The pessimists and skeptically oriented will claim that the failures highlight the limitations of AI, particularly when training data sets are small. We are still a long way from replacing graduate students with AI bots (or at least using AI to train students in the first year of their PhD).
What do you think? Should this study lead to optimism, pessimism, or just wait and see?
----------
Large language models (LLMs) have shown remarkable progress in coding and math problem-solving, but evaluation on advanced research-level problems in hard sciences remains scarce. To fill this gap, we present CMT-Benchmark, a dataset of 50 problems covering condensed matter theory (CMT) at the level of an expert researcher. Topics span analytical and computational approaches in quantum many-body, and classical statistical mechanics. The dataset was designed and verified by a panel of expert researchers from around the world. We built the dataset through a collaborative environment that challenges the panel to write and refine problems they would want a research assistant to solve, including Hartree-Fock, exact diagonalization, quantum/variational Monte Carlo, density matrix renormalization group (DMRG), quantum/classical statistical mechanics, and model building. We evaluate LLMs by programmatically checking solutions against expert-supplied ground truth. We developed machine-grading, including symbolic handling of non-commuting operators via normal ordering. They generalize across tasks too. Our evaluations show that frontier models struggle with all of the problems in the dataset, highlighting a gap in the physical reasoning skills of current LLMs. Notably, experts identified strategies for creating increasingly difficult problems by interacting with the LLMs and exploiting common failure modes. The best model, GPT5, solves 30\% of the problems; average across 17 models (GPT, Gemini, Claude, DeepSeek, Llama) is 11.4±2.1\%. Moreover, 18 problems are solved by none of the 17 models, and 26 by at most one. These unsolved problems span Quantum Monte Carlo, Variational Monte Carlo, and DMRG. Answers sometimes violate fundamental symmetries or have unphysical scaling dimensions. We believe this benchmark will guide development toward capable AI research assistants and tutors.
No comments:
Post a Comment