Skip to Main Content

Article navigation

Research Articles| March 23 2026

Comparative Evaluation of Advanced Reasoning Models for Clinical Decision Support in Urology

Subject Area: Further Areas

Ning He;

Ning He

Department of Urology, Hanzhong Central Hospital, Hanzhong, China

Search for other works by this author on:

PubMed

Google Scholar

Yang Wang;

Yang Wang

Department of Urology, Hanzhong Central Hospital, Hanzhong, China

Search for other works by this author on:

PubMed

Google Scholar

Xinhong Hu;

Xinhong Hu

Department of Urology, Hanzhong Central Hospital, Hanzhong, China

Search for other works by this author on:

PubMed

Google Scholar

Zhanhong Zhang;

Zhanhong Zhang

Department of Urology, Hanzhong Central Hospital, Hanzhong, China

Search for other works by this author on:

PubMed

Google Scholar

Yangkai Bai;

Yangkai Bai

Department of Urology, Hanzhong Central Hospital, Hanzhong, China

Search for other works by this author on:

PubMed

Google Scholar

Jianfu Wang;

Jianfu Wang

Department of Urology, Hanzhong Central Hospital, Hanzhong, China

Search for other works by this author on:

PubMed

Google Scholar

Nan Zheng;

Nan Zheng

Department of Urology, Hanzhong Central Hospital, Hanzhong, China

Search for other works by this author on:

PubMed

Google Scholar

Linhai Shao

Linhai Shao *

Department of Urology, Hanzhong Central Hospital, Hanzhong, China

[email protected]

Search for other works by this author on:

PubMed

Google Scholar

[email protected]

Urol Int (2026)

https://doi.org/10.1159/000551610

PubMed:

41871224

Graphical Abstract

Figure. Refer to the image caption for details.

View large Download slide

Abstract

Introduction: Advanced reasoning large language models are increasingly being explored for clinical decision support, but evidence regarding their performance in urology-specific tasks and their comparative standing against human respondents remains limited. This study compared five advanced reasoning models on urology-related multiple-choice questions from the MedQA dataset and benchmarked their performance against medical students and experienced urologists in terms of accuracy, response efficiency, and agreement patterns. Methods: We extracted 434 urology-relevant items and evaluated five models – DeepSeek-R1, ChatGPT O4-mini, Gemini 2.5 Pro, Claude 3.7 Sonnet, and Grok 3 – using a standardized prompt. Accuracy was computed against reference answers; API response times and connection failures were recorded. In addition, 20 senior medical students and 20 experienced urologists answered subsets of the same item bank using a balanced block design; group-level majority-vote answers were used as human baselines. Statistical analyses included Cochran’s Q and McNemar tests (artificial intelligence [AI]-only accuracy), a logistic generalized linear mixed-effects model with urologists as the reference (model-adjusted accuracy), Fleiss’ kappa and Cohen’s kappa (agreement), and Friedman and Wilcoxon signed-rank tests (response time). Results: Across the AI-only comparison, all models achieved high accuracy (86.9–93.3%), with DeepSeek-R1, ChatGPT O4-mini, and Gemini 2.5 Pro outperforming Claude 3.7 Sonnet and Grok 3. In the model-adjusted analysis, all five AI models showed significantly higher odds of correct answers than experienced urologists (all p < 0.001, Dunnett-adjusted), while medical students did not differ significantly from urologists. ChatGPT O4-mini had the shortest median API response time (5.03 s), whereas group-level median task completion times were 15.87 s for students and 17.57 s for urologists; Grok 3 was the slowest among AI models (27.62 s). Connection failure rates were 0% for ChatGPT O4-mini, Gemini 2.5 Pro, and Claude 3.7 Sonnet; 1.6% for DeepSeek-R1; and 2.8% for Grok 3. Agreement across the five AI models and the two human majority-vote baselines was moderate-to-substantial (Fleiss’ κ = 0.685, p < 0.001). Conclusion: Modern reasoning models achieve strong accuracy and efficiency on urology-focused benchmark questions, supporting their potential role as useful clinical assistants when implemented with appropriate human oversight. ChatGPT O4-mini’s rapid latency further underscores its suitability for time-sensitive workflows, while model-adjusted analyses indicate its consistently superior accuracy relative to experienced urologists within this standardized assessment format.

Plain Language Summary

This study compared five advanced “reasoning” artificial intelligence (AI) models on 434 urology-related multiple-choice questions from the public MedQA dataset (similar to exam-style questions). The models were DeepSeek-R1, ChatGPT O4-mini, Gemini 2.5 Pro, Claude 3.7 Sonnet, and Grok 3. To benchmark AI against people, we also included 20 senior medical students and 20 experienced urologists. All five AI models performed strongly, with accuracy ranging from 86.9% to 93.3%. DeepSeek-R1, ChatGPT O4-mini, and Gemini 2.5 Pro achieved the highest accuracy, while Claude 3.7 Sonnet and Grok 3 were slightly lower. In a model-adjusted analysis, all AI models had significantly higher odds of correct answers than experienced urologists, whereas medical students showed similar overall accuracy to urologists. This may reflect differences in strengths: students often have recently consolidated broad foundational knowledge that is frequently tested in multiple-choice formats, while specialists’ advantages are more evident in contextual, real-world decision-making. Response time also differed across groups. ChatGPT O4-mini had the fastest AI responses, while Grok 3 was the slowest among the AI models. Human completion times were intermediate. However, human time reflects reading, thinking, and clicking, whereas AI time reflects API response latency, so the two measures are not directly equivalent. Overall, these results suggest that modern AI reasoning models can provide fast and accurate support for urology-related medical questions. With appropriate human oversight, they may help clinicians check information efficiently and support clinical workflows.

References

1.

Plaat

A

,

Wong

A

,

Verberne

S

,

Broekens

J

,

van Stein

N

,

Back

T

.

Reasoning with large language models, a survey

.

arXiv

;

2024

.

2.

Chen

Q

,

Qin

L

,

Liu

J

,

Peng

D

,

Guan

J

,

Wang

P

, et al.

Towards reasoning era: a survey of long chain-of-thought for reasoning large language models

.

arXiv

;

2025

.

3.

Mondorf

P

,

Plank

B

.

Beyond accuracy: evaluating the reasoning behavior of large language models--a survey

.

arXiv

;

2024

.

4.

Guo

D

,

Yang

D

,

Zhang

H

,

Song

J

,

Zhang

R

,

Xu

R

, et al.

Deepseek-r1: incentivizing reasoning capability in llms via reinforcement learning

.

arXiv

;

2025

.

5.

Hershenhouse

JS

,

Gill

KS

,

Nabhani

J

.

The advance of artificial intelligence in outpatient urology: current applications and future directions

.

Curr Opin Urol

.

2025

;

35

(

3

):

214

–

8

.

6.

Touma

NJ

,

Patel

R

,

Skinner

T

,

Leveridge

M

.

Artificial intelligence as a discriminator of competence in urological training: are we there

.

J Urol

.

2025

;

213

(

4

):

504

–

11

.

7.

Mondillo

G

,

Masino

M

,

Colosimo

S

,

Perrotta

A

,

Frattolillo

V

.

Evaluating AI reasoning models in pediatric medicine: a comparative analysis of o3-mini and o3-mini-high

.

medRxiv

;

2025

.

8.

Scott Wang

HH

,

Vasdev

R

,

Nelson

CP

.

Artificial intelligence in pediatric urology

.

Urol Clin North Am

.

2024

;

51

(

1

):

91

–

103

.

9.

Alexa

R

,

Kranz

J

,

Kuppe

C

,

Hayat

S

,

Hoffmann

M

,

Saar

M

.

Artificial intelligence in urology-opportunities and possibilities

.

Urologie

.

2023

;

62

(

4

):

383

–

8

.

10.

Moryousef

J

,

Nadesan

P

,

Uy

M

,

Matti

D

,

Guo

Y

.

Assessing the efficacy and clinical utility of artificial intelligence scribes in urology

.

Urology

.

2025

;

196

:

12

–

7

.

11.

Zhou

G

,

Qiu

P

,

Chen

C

,

Wang

J

,

Yang

Z

,

Xu

J

, et al.

Reinforced MLLM: a survey on RL-based reasoning in multimodal large language models

.

arXiv

;

2025

.

12.

Hershenhouse

JS

,

Mokhtar

D

,

Eppler

MB

,

Rodler

S

,

Storino Ramacciotti

L

,

Ganjavi

C

, et al.

Accuracy, readability, and understandability of large language models for prostate cancer information to the public

.

Prostate Cancer Prostatic Dis

.

2025

;

28

(

2

):

394

–

9

.

13.

Robinson

EJ

,

Qiu

C

,

Sands

S

,

Khan

M

,

Vora

S

,

Oshima

K

, et al.

Physician vs. AI-generated messages in urology: evaluation of accuracy, completeness, and preference by patients and physicians

.

World J Urol

.

2024

;

43

(

1

):

48

.

© 2026 S. Karger AG, Basel

Copyright / Drug Dosage / Disclaimer

Copyright: All rights reserved. No part of this publication may be translated into other languages, reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, microcopying, or by any information storage and retrieval system, without permission in writing from the publisher.

Drug Dosage: The authors and the publisher have exerted every effort to ensure that drug selection and dosage set forth in this text are in accord with current recommendations and practice at the time of publication. However, in view of ongoing research, changes in government regulations, and the constant flow of information relating to drug therapy and drug reactions, the reader is urged to check the package insert for each drug for any changes in indications and dosage and for added warnings and precautions. This is particularly important when the recommended agent is a new and/or infrequently employed drug.

Disclaimer: The statements, opinions and data contained in this publication are solely those of the individual authors and contributors and not of the publishers and the editor(s). The appearance of advertisements or/and product references in the publication is not a warranty, endorsement, or approval of the products or services advertised or of their effectiveness, quality or safety. The publisher and the editor(s) disclaim responsibility for any injury to persons or property resulting from any ideas, methods, instructions or products referred to in the content or advertisements.

2026

You do not currently have access to this content.

Don't already have an account? Register