Comparative Evaluation of Advanced Reasoning Models for Clinical Decision Support in Urology

He, Ning; Wang, Yang; Hu, Xinhong; Zhang, Zhanhong; Bai, Yangkai; Wang, Jianfu; Zheng, Nan; Shao, Linhai

doi:10.1159/000551610

Abstract

Objective: To compare the performance of five advanced reasoning models on urology-related clinical multiple-choice questions from the MedQA dataset, and to benchmark AI performance against medical students and experienced urologists in terms of accuracy, response efficiency, and agreement patterns. Methods: We extracted 434 urology-relevant items and evaluated five models—DeepSeek-R1, ChatGPT O4-mini, Gemini 2.5 Pro, Claude 3.7 Sonnet, and Grok 3—using a standardized prompt. Accuracy was computed against reference answers; API response times and connection failures were recorded. In addition, 20 senior medical students and 20 experienced urologists answered subsets of the same item bank using a balanced block design; group-level majority-vote answers were used as human baselines. Statistical analyses included Cochran’s Q and McNemar tests (AI-only accuracy), a logistic generalized linear mixed-effects model (GLMM) with urologists as the reference (model-adjusted accuracy), Fleiss’ κ and Cohen’s κ (agreement), and Friedman and Wilcoxon signed-rank tests (response time). Results: Across the AI-only comparison, all models achieved high accuracy (86.9–93.3%), with DeepSeek-R1, ChatGPT O4-mini, and Gemini 2.5 Pro outperforming Claude 3.7 Sonnet and Grok 3. In the model-adjusted analysis, all five AI models showed significantly higher odds of correct answers than experienced urologists (all p < 0.001, Dunnett-adjusted), while medical students did not differ significantly from urologists. ChatGPT O4-mini had the shortest median API response time (5.03 s), whereas group-level median task completion times were 15.87 s for students and 17.57 s for urologists; Grok 3 was slowest among AI models (27.62 s). Connection failure rates were 0% for ChatGPT O4-mini, Gemini 2.5 Pro, and Claude 3.7 Sonnet; 1.6% for DeepSeek-R1; and 2.8% for Grok 3. Agreement across the five AI models and the two human majority-vote baselines was moderate-to-substantial (Fleiss’ κ = 0.685, p < 0.001). Conclusion: Modern reasoning models achieve strong accuracy and efficiency on urology-focused benchmark questions, supporting their potential role as useful clinical assistants when implemented with appropriate human oversight. ChatGPT O4-mini’s rapid latency further underscores its suitability for time-sensitive workflows, while model-adjusted analyses indicate its consistently superior accuracy relative to experienced urologists within this standardized assessment format.

This content is only available via PDF.

S. Karger AG, Basel

Copyright / Drug Dosage / Disclaimer

Copyright: All rights reserved. No part of this publication may be translated into other languages, reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, microcopying, or by any information storage and retrieval system, without permission in writing from the publisher.

Drug Dosage: The authors and the publisher have exerted every effort to ensure that drug selection and dosage set forth in this text are in accord with current recommendations and practice at the time of publication. However, in view of ongoing research, changes in government regulations, and the constant flow of information relating to drug therapy and drug reactions, the reader is urged to check the package insert for each drug for any changes in indications and dosage and for added warnings and precautions. This is particularly important when the recommended agent is a new and/or infrequently employed drug.

Disclaimer: The statements, opinions and data contained in this publication are solely those of the individual authors and contributors and not of the publishers and the editor(s). The appearance of advertisements or/and product references in the publication is not a warranty, endorsement, or approval of the products or services advertised or of their effectiveness, quality or safety. The publisher and the editor(s) disclaim responsibility for any injury to persons or property resulting from any ideas, methods, instructions or products referred to in the content or advertisements.

2026

You do not currently have access to this content.

Comparative Evaluation of Advanced Reasoning Models for Clinical Decision Support in Urology

Article PDF first page preview

Abstract

Digital Version

Email alerts

Citing articles via

Related Articles

INFORMATION

ABOUT US

SERVICES FOR

Karger International

Related Articles

Parameters predicting Recurrence after Focal Therapy for Prostate Cancer: Insights from a Multicenter Surveillance Database

Microhardness Variation in Calcium Oxalate Monohydrate Stones: A Potential Explanation for Fragment Formation During Laser Dusting

Comparative Evaluation of Advanced Reasoning Models for Clinical Decision Support in Urology

Perioperative Management of Upper Tract Urothelial Carcinoma: Current Evidence and Future Directions

Comparison of Contrast-Enhanced Ultrasound and Enhanced MRI in Assessing the Depth of Invasion in Bladder Urothelial Carcinoma

Comparative Evaluation of Advanced Reasoning Models for Clinical Decision Support in Urology Available to Purchase

Article PDF first page preview

Abstract

Sign in

Individual Login

Institutional Login

Digital Version

Email alerts

Citing articles via

Related Articles

Comparative Evaluation of Advanced Reasoning Models for Clinical Decision Support in Urology