Evaluating progress of LLMs on scientific problem-solving

June 2, 2025

2

Programmatic and model-based evaluations

Duties in CURIE are various and have ground-truth annotations in combined and heterogeneous type, e.g., as JSONs, latex equations, YAML information, or free-form textual content. Evaluating free-form technology is difficult as a result of solutions are sometimes descriptive, and even when a format is specified, as in most of our circumstances, the response to every subject can have differing types. For instance, supplies grid factors might typically be specified as “[p, q, r]” and at different occasions as “p × q × r”. Therefore, along with the programmatic analysis metrics, comparable to ROUGE-L, intersection-over-inion (used for BIOGR), and identification ratio (utilized in PDB), we suggest two model-based analysis metrics.

(1) LMScore: Prompts an LLM asking how carefully the predictions match floor fact on a 3-point scale: “good” if the prediction has few minor errors, “okay” if there are numerous minor errors, and “unhealthy” if there are main errors. We think about the weighted common of the log-likelihood scores of the tokens to provide a remaining confidence.

(2) LLMSim: Is used for retrieval duties the place we ask the mannequin to exhaustively extract many particulars, e.g., descriptors, properties and values of supplies from a analysis doc, and supply as output an unordered checklist of dictionaries or information. We use a chain-of-thought (CoT) immediate that asks the LLM to take a look at every ground-truth file and establish the expected information that accurately match every subject (key) and worth of the bottom fact. As soon as we match the ground-truth information with predicted information, we are able to then measure precision and recall for the retrieval activity, and compute the imply common precision, recall and F1 scores throughout all paperwork.

An interview with Jony Ive and Laurene Powell Jobs on tech’s subsequent chapter, assembly in 1997, Steve Jobs, Apple, Powell Jobs’ io funding in 2019, AI, and extra (Matthew Garrahan/Monetary Instances)

Evaluating progress of LLMs on scientific problem-solving

Programmatic and model-based evaluations

Related Articles

May AI perceive feelings higher than we do?

How Nexthink constructed real-time alerts with Amazon Managed Service for Apache Flink

Germany to host Europe’s largest Industrial AI computing centre, powered by 10,000 Nvidia chips

LEAVE A REPLY Cancel reply

Latest Articles

May AI perceive feelings higher than we do?

How Nexthink constructed real-time alerts with Amazon Managed Service for Apache Flink

Germany to host Europe’s largest Industrial AI computing centre, powered by 10,000 Nvidia chips

Mastering ChatGPT Immediate Patterns: Templates for Each Use

Stevens Prof Kevin Lu Drives Requirements Ahead