Beyond Benchmarks: Evaluating Generalist Medical Artificial Intelligence With Psychometrics

Rigorous evaluation of generalist medical artificial intelligence (GMAI) is imperative to ensure their utility and safety before implementation in health care. Current evaluation strategies rely heavily on benchmarks, which can suffer from issues with data contamination and cannot explain how GMAI m...

Full description

Saved in:
Bibliographic Details
Published inJournal of medical Internet research Vol. 27; no. 7956; p. e70901
Main Authors Sun, Luning, Gibbons, Christopher, Hernández-Orallo, José, Wang, Xiting, Jiang, Liming, Stillwell, David, Luo, Fang, Xie, Xing
Format Journal Article
LanguageEnglish
Published Canada Journal of Medical Internet Research 26.05.2025
Gunther Eysenbach MD MPH, Associate Professor
JMIR Publications
Subjects
Online AccessGet full text
ISSN1438-8871
1439-4456
1438-8871
DOI10.2196/70901

Cover

Loading…
Abstract Rigorous evaluation of generalist medical artificial intelligence (GMAI) is imperative to ensure their utility and safety before implementation in health care. Current evaluation strategies rely heavily on benchmarks, which can suffer from issues with data contamination and cannot explain how GMAI might fail (lacking explanatory power) or in what circumstances (lacking predictive power). To address these limitations, we propose a new methodology to improve the quality of GMAI evaluation using construct-oriented processes. Drawing on modern psychometric techniques, we introduce approaches to construct identification and present alternative assessment formats for different domains of professional skills, knowledge, and behaviors that are essential for safe practice. We also discuss the need for human oversight in future GMAI adoption.
AbstractList Rigorous evaluation of generalist medical artificial intelligence (GMAI) is imperative to ensure their utility and safety before implementation in health care. Current evaluation strategies rely heavily on benchmarks, which can suffer from issues with data contamination and cannot explain how GMAI might fail (lacking explanatory power) or in what circumstances (lacking predictive power). To address these limitations, we propose a new methodology to improve the quality of GMAI evaluation using construct-oriented processes. Drawing on modern psychometric techniques, we introduce approaches to construct identification and present alternative assessment formats for different domains of professional skills, knowledge, and behaviors that are essential for safe practice. We also discuss the need for human oversight in future GMAI adoption.UnlabelledRigorous evaluation of generalist medical artificial intelligence (GMAI) is imperative to ensure their utility and safety before implementation in health care. Current evaluation strategies rely heavily on benchmarks, which can suffer from issues with data contamination and cannot explain how GMAI might fail (lacking explanatory power) or in what circumstances (lacking predictive power). To address these limitations, we propose a new methodology to improve the quality of GMAI evaluation using construct-oriented processes. Drawing on modern psychometric techniques, we introduce approaches to construct identification and present alternative assessment formats for different domains of professional skills, knowledge, and behaviors that are essential for safe practice. We also discuss the need for human oversight in future GMAI adoption.
Rigorous evaluation of generalist medical artificial intelligence (GMAI) is imperative to ensure their utility and safety before implementation in health care. Current evaluation strategies rely heavily on benchmarks, which can suffer from issues with data contamination and cannot explain how GMAI might fail (lacking explanatory power) or in what circumstances (lacking predictive power). To address these limitations, we propose a new methodology to improve the quality of GMAI evaluation using construct-oriented processes. Drawing on modern psychometric techniques, we introduce approaches to construct identification and present alternative assessment formats for different domains of professional skills, knowledge, and behaviors that are essential for safe practice. We also discuss the need for human oversight in future GMAI adoption.
AbstractRigorous evaluation of generalist medical artificial intelligence (GMAI) is imperative to ensure their utility and safety before implementation in health care. Current evaluation strategies rely heavily on benchmarks, which can suffer from issues with data contamination and cannot explain how GMAI might fail (lacking explanatory power) or in what circumstances (lacking predictive power). To address these limitations, we propose a new methodology to improve the quality of GMAI evaluation using construct-oriented processes. Drawing on modern psychometric techniques, we introduce approaches to construct identification and present alternative assessment formats for different domains of professional skills, knowledge, and behaviors that are essential for safe practice. We also discuss the need for human oversight in future GMAI adoption.
Audience Academic
Author Jiang, Liming
Sun, Luning
Wang, Xiting
Hernández-Orallo, José
Luo, Fang
Gibbons, Christopher
Xie, Xing
Stillwell, David
Author_xml – sequence: 1
  givenname: Luning
  orcidid: 0000-0002-2470-4278
  surname: Sun
  fullname: Sun, Luning
– sequence: 2
  givenname: Christopher
  orcidid: 0000-0002-4732-7305
  surname: Gibbons
  fullname: Gibbons, Christopher
– sequence: 3
  givenname: José
  orcidid: 0000-0001-9746-7632
  surname: Hernández-Orallo
  fullname: Hernández-Orallo, José
– sequence: 4
  givenname: Xiting
  orcidid: 0000-0001-5768-1095
  surname: Wang
  fullname: Wang, Xiting
– sequence: 5
  givenname: Liming
  orcidid: 0000-0001-6464-2326
  surname: Jiang
  fullname: Jiang, Liming
– sequence: 6
  givenname: David
  orcidid: 0000-0003-0174-3212
  surname: Stillwell
  fullname: Stillwell, David
– sequence: 7
  givenname: Fang
  orcidid: 0000-0003-3281-9574
  surname: Luo
  fullname: Luo, Fang
– sequence: 8
  givenname: Xing
  orcidid: 0009-0009-3257-3077
  surname: Xie
  fullname: Xie, Xing
BackLink https://www.ncbi.nlm.nih.gov/pubmed/40418851$$D View this record in MEDLINE/PubMed
BookMark eNptkk1v1DAQhiNURD_oX0CREBKXLf5K7HBB26qUlYrgAOLAwXLsSdZL1m7tpNL-e2a7pXQR8sGj8TOv5x3NcXEQYoCiOKXkjNGmfidJQ-iz4ogKrmZKSXrwJD4sjnNeEcKIaOiL4lAQQZWq6FHx8xw2MbjyHIJdrk36ld-Xl3dmmMzoQ19eQYBkBp_H8jM4b81QztPoO289hoswwjD4Hmuh_OHHZfk1b-wyrmFM3uaXxfPODBlOH-6T4vvHy28Xn2bXX64WF_PrmcVuxplwVV21jghTG2ppq0AqZaRjHbXGtdIQ1QjeKC6dI7IWDmqmrDTIMQYd8JNisdN10az0TfLoY6Oj8fo-EVOvDTZtB9BtVzlZc0kcKMFbq4htGyZZa6qWckFR68NO62Zq1-AshBH974nuvwS_1H2805RRhm1uFd4-KKR4O0Ee9dpni3MyAeKUNWdblFBWI_r6H3QVpxRwVkgxxuumYvwv1Rt04EMX8WO7FdVzJWhNhWAVUmf_ofA4WHuLu9J5zO8VvHrq9NHin91A4M0OsCnmnKB7RCjR253T9zvHfwM_EsXn
Cites_doi 10.1111/1475-6773.14016
10.1038/s41586-023-05881-4
10.1146/annurev.clinpsy.032408.153553
10.4324/9780203803912
10.1038/s41746-024-01344-w
10.1017/9781316594179
10.2196/52597
10.1016/S2589-7500(23)00048-1
10.4324/9781315787527
10.1111/nyas.15007
10.1038/s41746-024-01258-7
10.2196/53616
10.1038/s41746-024-01208-3
10.1037/1040-3590.4.1.26
10.1038/s41598-021-02481-y
10.2196/48633
10.4324/9781410605269
10.1038/s41586-023-06291-2
10.1038/s41746-024-01083-y
10.1016/j.artint.2018.09.004
ContentType Journal Article
Copyright Luning Sun, Christopher Gibbons, José Hernández-Orallo, Xiting Wang, Liming Jiang, David Stillwell, Fang Luo, Xing Xie. Originally published in the Journal of Medical Internet Research (https://www.jmir.org).
COPYRIGHT 2025 Journal of Medical Internet Research
2025. This work is licensed under https://creativecommons.org/licenses/by/4.0/" target="_blank">https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Copyright © Luning Sun, Christopher Gibbons, José Hernández-Orallo, Xiting Wang, Liming Jiang, David Stillwell, Fang Luo, Xing Xie. Originally published in the Journal of Medical Internet Research (https://www.jmir.org) 2025
Copyright_xml – notice: Luning Sun, Christopher Gibbons, José Hernández-Orallo, Xiting Wang, Liming Jiang, David Stillwell, Fang Luo, Xing Xie. Originally published in the Journal of Medical Internet Research (https://www.jmir.org).
– notice: COPYRIGHT 2025 Journal of Medical Internet Research
– notice: 2025. This work is licensed under https://creativecommons.org/licenses/by/4.0/" target="_blank">https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
– notice: Copyright © Luning Sun, Christopher Gibbons, José Hernández-Orallo, Xiting Wang, Liming Jiang, David Stillwell, Fang Luo, Xing Xie. Originally published in the Journal of Medical Internet Research (https://www.jmir.org) 2025
DBID AAYXX
CITATION
CGR
CUY
CVF
ECM
EIF
NPM
3V.
7QJ
7RV
7X7
7XB
8FI
8FJ
8FK
ABUWG
AFKRA
ALSLI
AZQEC
BENPR
CCPQU
CNYFK
COVID
DWQXO
E3H
F2A
FYUFA
GHDGH
K9.
KB0
M0S
M1O
NAPCQ
PHGZM
PHGZT
PIMPY
PKEHL
PPXIY
PQEST
PQQKQ
PQUKI
PRINS
PRQQA
7X8
5PM
DOA
DOI 10.2196/70901
DatabaseName CrossRef
Medline
MEDLINE
MEDLINE (Ovid)
MEDLINE
MEDLINE
PubMed
ProQuest Central (Corporate)
Applied Social Sciences Index & Abstracts (ASSIA)
Nursing & Allied Health Database
ProQuest Health & Medical Collection
ProQuest Central (purchase pre-March 2016)
ProQuest Hospital Collection
Hospital Premium Collection (Alumni Edition)
ProQuest Central (Alumni) (purchase pre-March 2016)
ProQuest Central (Alumni)
ProQuest Central UK/Ireland
Social Science Premium Collection
ProQuest Central Essentials
ProQuest Central
ProQuest One
Library & Information Science Collection
Coronavirus Research Database
ProQuest Central
Library & Information Sciences Abstracts (LISA)
Library & Information Science Abstracts (LISA)
Health Research Premium Collection
Health Research Premium Collection (Alumni)
ProQuest Health & Medical Complete (Alumni)
Nursing & Allied Health Database (Alumni Edition)
ProQuest Health & Medical Collection
Library Science Database
Nursing & Allied Health Premium
ProQuest Central Premium
ProQuest One Academic
Publicly Available Content Database
ProQuest One Academic Middle East (New)
ProQuest One Health & Nursing
ProQuest One Academic Eastern Edition (DO NOT USE)
ProQuest One Academic
ProQuest One Academic UKI Edition
ProQuest Central China
ProQuest One Social Sciences
MEDLINE - Academic
PubMed Central (Full Participant titles)
DOAJ Directory of Open Access Journals
DatabaseTitle CrossRef
MEDLINE
Medline Complete
MEDLINE with Full Text
PubMed
MEDLINE (Ovid)
Publicly Available Content Database
ProQuest One Academic Middle East (New)
Library and Information Science Abstracts (LISA)
ProQuest Central Essentials
ProQuest Health & Medical Complete (Alumni)
ProQuest Central (Alumni Edition)
ProQuest One Community College
ProQuest One Health & Nursing
Applied Social Sciences Index and Abstracts (ASSIA)
ProQuest Central China
ProQuest Central
ProQuest Library Science
Health Research Premium Collection
Health and Medicine Complete (Alumni Edition)
ProQuest Central Korea
Library & Information Science Collection
ProQuest Central (New)
Social Science Premium Collection
ProQuest One Social Sciences
ProQuest One Academic Eastern Edition
Coronavirus Research Database
ProQuest Nursing & Allied Health Source
ProQuest Hospital Collection
Health Research Premium Collection (Alumni)
ProQuest Hospital Collection (Alumni)
Nursing & Allied Health Premium
ProQuest Health & Medical Complete
ProQuest One Academic UKI Edition
ProQuest Nursing & Allied Health Source (Alumni)
ProQuest One Academic
ProQuest One Academic (New)
ProQuest Central (Alumni)
MEDLINE - Academic
DatabaseTitleList MEDLINE - Academic
Publicly Available Content Database


CrossRef

MEDLINE
Database_xml – sequence: 1
  dbid: DOA
  name: DOAJ Directory of Open Access Journals
  url: https://www.doaj.org/
  sourceTypes: Open Website
– sequence: 2
  dbid: NPM
  name: PubMed
  url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
  sourceTypes: Index Database
– sequence: 3
  dbid: EIF
  name: MEDLINE
  url: https://proxy.k.utb.cz/login?url=https://www.webofscience.com/wos/medline/basic-search
  sourceTypes: Index Database
– sequence: 4
  dbid: BENPR
  name: ProQuest Central
  url: https://www.proquest.com/central
  sourceTypes: Aggregation Database
DeliveryMethod fulltext_linktorsrc
Discipline Medicine
Library & Information Science
EISSN 1438-8871
EndPage e70901
ExternalDocumentID oai_doaj_org_article_bf5d76370de843bc80cb9272ba5b1341
PMC12129431
A841614425
40418851
10_2196_70901
Genre Journal Article
GeographicLocations United States
United Kingdom--UK
GeographicLocations_xml – name: United States
– name: United Kingdom--UK
GroupedDBID ---
.4I
.DC
29L
2WC
36B
53G
5GY
5VS
77K
7RV
7X7
8FI
8FJ
AAFWJ
AAKPC
AAWTL
AAYXX
ABDBF
ABIVO
ABUWG
ACGFO
ADBBV
AEGXH
AENEX
AFKRA
AFPKN
AIAGR
ALIPV
ALMA_UNASSIGNED_HOLDINGS
ALSLI
AOIJS
BAWUL
BCNDV
BENPR
CCPQU
CITATION
CNYFK
CS3
DIK
DU5
DWQXO
E3Z
EAP
EBD
EBS
EJD
ELW
EMB
EMOBN
ESX
F5P
FRP
FYUFA
GROUPED_DOAJ
GX1
HMCUK
HYE
IAO
ICO
IEA
IHR
INH
ISN
ITC
KQ8
M1O
M48
NAPCQ
OK1
OVT
P2P
PGMZT
PHGZM
PHGZT
PIMPY
PQQKQ
RNS
RPM
SJN
SV3
TR2
UKHRP
XSB
ACUHS
CGR
CUY
CVF
ECM
EIF
NPM
PPXIY
PRQQA
PMFND
3V.
7QJ
7XB
8FK
AZQEC
COVID
E3H
F2A
K9.
PKEHL
PQEST
PQUKI
PRINS
7X8
5PM
PUEGO
ID FETCH-LOGICAL-c491t-4d565bd04a6a1c1b8e788a7d2f1cadb7a089439837dd0764de628c7a8e722efe3
IEDL.DBID DOA
ISSN 1438-8871
1439-4456
IngestDate Wed Aug 27 01:20:59 EDT 2025
Thu Aug 21 18:24:37 EDT 2025
Fri Jul 11 17:11:40 EDT 2025
Fri Jul 25 09:19:20 EDT 2025
Tue Jun 17 21:55:23 EDT 2025
Tue Jun 10 21:07:54 EDT 2025
Mon Jul 21 05:31:04 EDT 2025
Sun Jul 06 05:05:11 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 7956
Keywords generalist medical artificial intelligence
human oversight
predictive power
construct-oriented evaluation
data contamination
explanatory power
psychometrics
benchmark
health care
Language English
License Luning Sun, Christopher Gibbons, José Hernández-Orallo, Xiting Wang, Liming Jiang, David Stillwell, Fang Luo, Xing Xie. Originally published in the Journal of Medical Internet Research (https://www.jmir.org).
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research (ISSN 1438-8871), is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c491t-4d565bd04a6a1c1b8e788a7d2f1cadb7a089439837dd0764de628c7a8e722efe3
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
CG is an employee of Oracle Health Inc., serves on the Board of Directors at the International Society for Quality of Life Research, and holds stock in Oracle Corporation. XW has previously been employed at Microsoft Research and holds stock in Microsoft. LJ has previously served as an intern at Microsoft Research. XX is an employee of Microsoft Research and holds stock in Microsoft. All other authors declare no conflicts of interest.
these authors contributed equally
ORCID 0000-0002-2470-4278
0009-0009-3257-3077
0000-0002-4732-7305
0000-0001-5768-1095
0000-0003-3281-9574
0000-0003-0174-3212
0000-0001-6464-2326
0000-0001-9746-7632
OpenAccessLink https://doaj.org/article/bf5d76370de843bc80cb9272ba5b1341
PMID 40418851
PQID 3222369523
PQPubID 2033121
ParticipantIDs doaj_primary_oai_doaj_org_article_bf5d76370de843bc80cb9272ba5b1341
pubmedcentral_primary_oai_pubmedcentral_nih_gov_12129431
proquest_miscellaneous_3212120126
proquest_journals_3222369523
gale_infotracmisc_A841614425
gale_infotracacademiconefile_A841614425
pubmed_primary_40418851
crossref_primary_10_2196_70901
PublicationCentury 2000
PublicationDate 2025-05-26
PublicationDateYYYYMMDD 2025-05-26
PublicationDate_xml – month: 05
  year: 2025
  text: 2025-05-26
  day: 26
PublicationDecade 2020
PublicationPlace Canada
PublicationPlace_xml – name: Canada
– name: Toronto
– name: Toronto, Canada
PublicationTitle Journal of medical Internet research
PublicationTitleAlternate J Med Internet Res
PublicationYear 2025
Publisher Journal of Medical Internet Research
Gunther Eysenbach MD MPH, Associate Professor
JMIR Publications
Publisher_xml – name: Journal of Medical Internet Research
– name: Gunther Eysenbach MD MPH, Associate Professor
– name: JMIR Publications
References Moor (R1); 616
Tam (R4); 7
Chustecki (R29); 13
Duckworth (R8); 11
Bommasani (R19); 1525
Ali (R2); 5
R23
Singhal (R5); 620
R24
R27
Reise (R25); 5
Goldberg (R17); 4
Sorin (R21); 26
Nembhard (R20); 58
R3
R6
R7
Mehandru (R22); 7
R9
R10
R12
R14
R13
R16
R18
Riedemann (R15); 7
Hassan (R28); 11
Kaczmarczyk (R11)
Martínez-Plumed (R26); 271
References_xml – ident: R13
– volume: 58
  start-page: 250
  issue: 2
  ident: R20
  article-title: A systematic review of research on empathy in health care
  publication-title: Health Serv Res
  doi: 10.1111/1475-6773.14016
– volume: 616
  start-page: 259
  issue: 7956
  ident: R1
  article-title: Foundation models for generalist medical artificial intelligence
  publication-title: Nature New Biol
  doi: 10.1038/s41586-023-05881-4
– volume: 5
  ident: R25
  article-title: Item response theory and clinical measurement
  publication-title: Annu Rev Clin Psychol
  doi: 10.1146/annurev.clinpsy.032408.153553
– ident: R27
  doi: 10.4324/9780203803912
– ident: R23
– volume: 7
  issue: 1
  ident: R15
  article-title: The path forward for large language models in medicine is open
  publication-title: NPJ Digit Med
  doi: 10.1038/s41746-024-01344-w
– ident: R7
  doi: 10.1017/9781316594179
– volume: 26
  ident: R21
  article-title: Large language models and empathy: systematic review
  publication-title: J Med Internet Res
  doi: 10.2196/52597
– ident: R3
– volume: 5
  start-page: e179
  issue: 4
  ident: R2
  article-title: Using ChatGPT to write patient clinic letters
  publication-title: Lancet Digit Health
  doi: 10.1016/S2589-7500(23)00048-1
– ident: R10
– ident: R16
  doi: 10.4324/9781315787527
– volume: 1525
  start-page: 140
  issue: 1
  ident: R19
  article-title: Holistic evaluation of language models
  publication-title: Ann N Y Acad Sci
  doi: 10.1111/nyas.15007
– volume: 7
  issue: 1
  ident: R4
  article-title: A framework for human evaluation of large language models in healthcare derived from literature review
  publication-title: NPJ Digit Med
  doi: 10.1038/s41746-024-01258-7
– ident: R18
– ident: R9
– ident: R14
– ident: R12
– volume: 13
  ident: R29
  article-title: Benefits and risks of AI in health care: narrative review
  publication-title: Interact J Med Res
  doi: 10.2196/53616
– ident: R11
  article-title: Evaluating multimodal AI in medical diagnostics
  publication-title: NPJ Digit Med
  doi: 10.1038/s41746-024-01208-3
– volume: 4
  start-page: 26
  issue: 1
  ident: R17
  article-title: The development of markers for the big-five factor structure
  publication-title: Psychol Assess
  doi: 10.1037/1040-3590.4.1.26
– volume: 11
  issue: 1
  ident: R8
  article-title: Using explainable machine learning to characterise data drift and detect emergent health risks for emergency department admissions during COVID-19
  publication-title: Sci Rep
  doi: 10.1038/s41598-021-02481-y
– volume: 11
  ident: R28
  article-title: Barriers to and facilitators of artificial intelligence adoption in health care: scoping review
  publication-title: JMIR Hum Factors
  doi: 10.2196/48633
– ident: R24
  doi: 10.4324/9781410605269
– volume: 620
  start-page: 172
  issue: 7972
  ident: R5
  article-title: Large language models encode clinical knowledge
  publication-title: Nature New Biol
  doi: 10.1038/s41586-023-06291-2
– volume: 7
  issue: 1
  ident: R22
  article-title: Evaluating large language models as agents in the clinic
  publication-title: NPJ Digit Med
  doi: 10.1038/s41746-024-01083-y
– ident: R6
– volume: 271
  ident: R26
  article-title: Item response theory in AI: Analysing machine learning classifiers at the instance level
  publication-title: Artif Intell
  doi: 10.1016/j.artint.2018.09.004
SSID ssj0020491
Score 2.4393966
SecondaryResourceType review_article
Snippet Rigorous evaluation of generalist medical artificial intelligence (GMAI) is imperative to ensure their utility and safety before implementation in health care....
AbstractRigorous evaluation of generalist medical artificial intelligence (GMAI) is imperative to ensure their utility and safety before implementation in...
SourceID doaj
pubmedcentral
proquest
gale
pubmed
crossref
SourceType Open Website
Open Access Repository
Aggregation Database
Index Database
StartPage e70901
SubjectTerms Accuracy
Applications of AI
Artificial Intelligence
Benchmarking
Benchmarks
Chatbots
Cognition & reasoning
Cognitive ability
Development and Evaluation of Research Methods, Instruments and Tools
Empathy
Health care
Humans
Large language models
Licenses
Licensing examinations
Medical practices
Methods
Personality
Physicians
Power
Program Evaluation and Review Technique
Psychometrics
Quantitative psychology
Subject specialists
Verbal communication
Viewpoint
Viewpoints and s
Work skills
SummonAdditionalLinks – databaseName: Library Science Database
  dbid: M1O
  link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3db9QwDLdgSBMS4mN8FbYpSIg9dZfk2qbHC9rQpoE0eGHaJB6ifHU3TeuNu94Lfz12mx5XkHhA91anvaR2HLu2fwZ462RhlKhM6tA4SKmWMS2l4qkNXAQ3zquizc05_VKcnGWfL_KL-MFtEdMqe53YKmo_c_SNfEQRgXExQb_pw-2PlLpGUXQ1ttC4C_cEHfWUuie-rhwutH7FJjygdGcUtJHik9j7pT9_Wpj-v5Xx2mk0zJRcO3qOH4HuJ91lnFzvLxu7737-gef4_6t6DA-jVcoOOjF6AndCvQU7saaBvWOxaImYyKI22ILN0xiXfwrfu0IYdoiU6Y2ZXy_es6OII15fsohtjRLFYmCo_asOvIJ9WkMFZedXzZR1WvmGmn25xTM4Oz769vEkjW0bUofvu0kzj0ai9TwzhRFO2DKgm22Ul5VwxltlOGG-T9Az9p6rIvOhkKVTBsdJGaowfg4b9awOL4Epm3vSSSUaivhsb71HA4TqiaVwyN0Edntm6tsOnUOjV0Pc1i23EzgkFq-IBKbdXpjNL3Xcm9pWuUc1q7gPZTa2ruTOTqSS1uSW8O4S2CMB0bTlUQqciZULOEcCz9IHFLpFx1TmCWwPRuJWdUNyLwU6qoqF_i0CCbxZkelOSn-rw2xJYwT-0JYoEnjRSeRqSRnPRIl2cwLlQFYHax5S6qtpCyROD0VWiFf_ntdruC-p6zGnfMht2Gjmy7CDplhjd9v99guCrTXq
  priority: 102
  providerName: ProQuest
Title Beyond Benchmarks: Evaluating Generalist Medical Artificial Intelligence With Psychometrics
URI https://www.ncbi.nlm.nih.gov/pubmed/40418851
https://www.proquest.com/docview/3222369523
https://www.proquest.com/docview/3212120126
https://pubmed.ncbi.nlm.nih.gov/PMC12129431
https://doaj.org/article/bf5d76370de843bc80cb9272ba5b1341
Volume 27
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV3dSxtBEB_aFKRQSrWtXqthC6V9Otzd3N1e-mYkokJsKZUGfFj264yIZzHJ_9-Zu03I4YMvEshDZnMf89vZmWFnfgvw1cnCKFGZ1GFwkFIvY1pKxVMbuAhukFdFU5szuShOL7PzaT7dOOqLasJaeuBWcYe2yj3agOI-lNnAupI7O5RKWpNbIiOj1Rd93iqZiqkWxr1iC95QoTNOsUPFh_HUl5XnaQj6Hy_DG36oWyO54XRO3sHbGC2yo_Ypt-FFqHfgIPYasG8sNhORclm00h3YmsT98vdw1TaosBFKZnfm4Xb-g40jv3d9zSLnNCLN4oZNc6uWVIKdbbB1sr83ixlrV8s7OoTLzT_A5cn4z_FpGo9TSB1qY5FmHoM363lmCiOcsGXA9NcoLyvhjLfKcOJiH2LG6j1XReZDIUunDI6TMlRh8BF69X0d9oApm3taK0oM4PDa3nqPgQH1-UrhUPcJ9Feq1v9a1gyN2QZhoRssEhgRAGshkVw3PyD0OkKvn4I-ge8EnyZTRIyciR0F-IxEaqWPaEsVE0aZJ7DfGYkm5Lri1QTQ0YTnmragBsUQE_UEvqzF9E8qS6vD_ZLGCPygjy8S2G3ny_qVMp6JEuPZBMrOTOq8c1dS38wagm-6KEIhPj2Hlj7Da0lnFnOqZtyH3uJhGQ4wkFrYPrxUU9WHV6Pxxa_f_caC8Hsifv4HAVQe-Q
linkProvider Directory of Open Access Journals
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMw1V1bT9RAFD5BTJDEeEHFKuCYeHkqdGZ7NTEGFLIrLL5AJPGhzq0sIXRxLzH6o_wr_iXPaafrVhPfeDD7tjM722m_c-uc8x2AZ1rEMuGF9DU6Bz7VMvqpSAJf2YBb3YmKuMrN6R_G3ePw_Ul0sgA_mloYSqtsdGKlqM1Q0zvyLToR6MQZxk1vLr_41DWKTlebFho1LPbtt68Yso1f997h830uxN7u0duu77oK-DrM-MQPDfowygShjCXXXKUWo0CZGFFwLY1KZECU5BkGbsZgkB8aG4tUJxLnCWEL28F1r8F1dCNSUgR9_mEW4KG3zZfgJqVXI7C3kiBzvWYae1e1Bfhb-c9Zv3Zm5pyp27sNP5ubVGe4nG9OJ2pTf_-DP_L_uYt34Jbzutl2LSZ3YcGWK7DuajbYC-aKsgikzGm7FVjqu7yDe_CpLvRhOzgyuJCj8_Ertut40stT5ri7UWKYO_iq_qom52C9OdZT9vFsMmC11bmgZmZ6fB-Or2TrD2CxHJb2IbBERYZ0boqOMK5tlDHoYFG9tOAa0eTBRgOe_LJmH8kxaiN05RW6PNghSM0GiSy8-mI4Os2d7slVERk0I0lgbBp2lE4DrTKRCCUjRXx-HrwkQOak0hB1WrrKDLxGIgfLt-loGgNvEXmw1pqJqki3hxvU5U4VjvPfkPPg6WyYfknpfaUdTmkOxw_6SrEHq7UEzLYUBihpGBd4kLZko7Xn9kh5NqiI0mlRfBT80b-v6wnc6B71D_KD3uH-Y1gW1OE5oNzPNVicjKZ2Hd3OidqoZJ3B56sWi1_qb5PS
linkToPdf http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtR3batRA9FArLIKI1lu0rSN4eQqbmU0yiSDS2i5da4sPFhf6EOeWbpFm6-4W8df8Os9JJtsNgm9l33Ims8mce84N4JURqZK8VKFB4yCkWsYwEzIKtYu4M4OkTOvcnKPj9OAk_jROxmvwp62FobTKVibWgtpODX0j71NEYJDm6Df1S58W8WVv-OHyZ0gTpCjS2o7TaEjk0P3-he7b_P1oD3H9Wojh_tePB6GfMBCaOOeLMLZoz2gbxSpV3HCdOfQIlbSi5EZZLVVE7clzdOKsRYc_ti4VmZEK1wnhSjfAfW_BbTlAtYm8JMfXzh5a3rwHdynVGom8L6Pcz51pdV89IuBfRbCiCbtZmitqb3gf7nl7le00BPYA1ly1AVu-2oG9Yb6cidDLvJzYgN6Rj9g_hNOmRIbtImRyoWY_5u_Yvu8wXp0x3_UaaY35kFH9V01bCzZa6RfKvp0vJqyR1xc0BszMH8HJjRz3Y1ivppV7CkzqxJK0ytCExL2tthZNE6o0Ftzg2Qew3R51cdn07SjQ3yFcFDUuAtglBCyB1Ga7vjCdnRWeawtdJhYFsIysy-KBNllkdC6k0CrR1AkvgLeEvoKEAeLIKF_TgM9IbbWKHQrqossqkgA2OyuRiU0X3BJA4YXIvLgm-QBeLsF0JyXGVW56RWs4_tDKSAN40tDL8pXiKOYZWtQBZB1K6rxzF1KdT-oW47QpooI_-_9zvYAeMmHxeXR8-BzuCBqNHFHS5CasL2ZXbgvttYXerhmDwfeb5sS_F5tUsw
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Beyond+Benchmarks%3A+Evaluating+Generalist+Medical+Artificial+Intelligence+With+Psychometrics&rft.jtitle=Journal+of+medical+Internet+research&rft.au=Sun%2C+Luning&rft.au=Gibbons%2C+Christopher&rft.au=Hern%C3%A1ndez-Orallo%2C+Jos%C3%A9&rft.au=Wang%2C+Xiting&rft.date=2025-05-26&rft.pub=Journal+of+Medical+Internet+Research&rft.issn=1439-4456&rft.volume=27&rft.issue=7956&rft_id=info:doi/10.2196%2F70901&rft.externalDocID=A841614425
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1438-8871&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1438-8871&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1438-8871&client=summon