MaterialBENCH: Evaluating College-Level Materials Science Problem-Solving Abilities of Large Language Models
A college-level benchmark dataset for large language models (LLMs) in the materials science field, MaterialBENCH, is constructed. This dataset consists of problem-answer pairs, based on university textbooks. There are two types of problems: one is the free-response answer type, and the other is the...
Saved in:
Main Authors | , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
04.09.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | A college-level benchmark dataset for large language models (LLMs) in the
materials science field, MaterialBENCH, is constructed. This dataset consists
of problem-answer pairs, based on university textbooks. There are two types of
problems: one is the free-response answer type, and the other is the
multiple-choice type. Multiple-choice problems are constructed by adding three
incorrect answers as choices to a correct answer, so that LLMs can choose one
of the four as a response. Most of the problems for free-response answer and
multiple-choice types overlap except for the format of the answers. We also
conduct experiments using the MaterialBENCH on LLMs, including ChatGPT-3.5,
ChatGPT-4, Bard (at the time of the experiments), and GPT-3.5 and GPT-4 with
the OpenAI API. The differences and similarities in the performance of LLMs
measured by the MaterialBENCH are analyzed and discussed. Performance
differences between the free-response type and multiple-choice type in the same
models and the influence of using system massages on multiple-choice problems
are also studied. We anticipate that MaterialBENCH will encourage further
developments of LLMs in reasoning abilities to solve more complicated problems
and eventually contribute to materials research and discovery. |
---|---|
DOI: | 10.48550/arxiv.2409.03161 |