Skip to Main Content
Article navigation

Introduction: Advanced reasoning large language models are increasingly being explored for clinical decision support, but evidence regarding their performance in urology-specific tasks and their comparative standing against human respondents remains limited. This study compared five advanced reasoning models on urology-related multiple-choice questions from the MedQA dataset and benchmarked their performance against medical students and experienced urologists in terms of accuracy, response efficiency, and agreement patterns. Methods: We extracted 434 urology-relevant items and evaluated five models – DeepSeek-R1, ChatGPT O4-mini, Gemini 2.5 Pro, Claude 3.7 Sonnet, and Grok 3 – using a standardized prompt. Accuracy was computed against reference answers; API response times and connection failures were recorded. In addition, 20 senior medical students and 20 experienced urologists answered subsets of the same item bank using a balanced block design; group-level majority-vote answers were used as human baselines. Statistical analyses included Cochran’s Q and McNemar tests (artificial intelligence [AI]-only accuracy), a logistic generalized linear mixed-effects model with urologists as the reference (model-adjusted accuracy), Fleiss’ kappa and Cohen’s kappa (agreement), and Friedman and Wilcoxon signed-rank tests (response time). Results: Across the AI-only comparison, all models achieved high accuracy (86.9–93.3%), with DeepSeek-R1, ChatGPT O4-mini, and Gemini 2.5 Pro outperforming Claude 3.7 Sonnet and Grok 3. In the model-adjusted analysis, all five AI models showed significantly higher odds of correct answers than experienced urologists (all p < 0.001, Dunnett-adjusted), while medical students did not differ significantly from urologists. ChatGPT O4-mini had the shortest median API response time (5.03 s), whereas group-level median task completion times were 15.87 s for students and 17.57 s for urologists; Grok 3 was the slowest among AI models (27.62 s). Connection failure rates were 0% for ChatGPT O4-mini, Gemini 2.5 Pro, and Claude 3.7 Sonnet; 1.6% for DeepSeek-R1; and 2.8% for Grok 3. Agreement across the five AI models and the two human majority-vote baselines was moderate-to-substantial (Fleiss’ κ = 0.685, p < 0.001). Conclusion: Modern reasoning models achieve strong accuracy and efficiency on urology-focused benchmark questions, supporting their potential role as useful clinical assistants when implemented with appropriate human oversight. ChatGPT O4-mini’s rapid latency further underscores its suitability for time-sensitive workflows, while model-adjusted analyses indicate its consistently superior accuracy relative to experienced urologists within this standardized assessment format.

This study compared five advanced “reasoning” artificial intelligence (AI) models on 434 urology-related multiple-choice questions from the public MedQA dataset (similar to exam-style questions). The models were DeepSeek-R1, ChatGPT O4-mini, Gemini 2.5 Pro, Claude 3.7 Sonnet, and Grok 3. To benchmark AI against people, we also included 20 senior medical students and 20 experienced urologists. All five AI models performed strongly, with accuracy ranging from 86.9% to 93.3%. DeepSeek-R1, ChatGPT O4-mini, and Gemini 2.5 Pro achieved the highest accuracy, while Claude 3.7 Sonnet and Grok 3 were slightly lower. In a model-adjusted analysis, all AI models had significantly higher odds of correct answers than experienced urologists, whereas medical students showed similar overall accuracy to urologists. This may reflect differences in strengths: students often have recently consolidated broad foundational knowledge that is frequently tested in multiple-choice formats, while specialists’ advantages are more evident in contextual, real-world decision-making. Response time also differed across groups. ChatGPT O4-mini had the fastest AI responses, while Grok 3 was the slowest among the AI models. Human completion times were intermediate. However, human time reflects reading, thinking, and clicking, whereas AI time reflects API response latency, so the two measures are not directly equivalent. Overall, these results suggest that modern AI reasoning models can provide fast and accurate support for urology-related medical questions. With appropriate human oversight, they may help clinicians check information efficiently and support clinical workflows.

1.
Plaat
A
,
Wong
A
,
Verberne
S
,
Broekens
J
,
van Stein
N
,
Back
T
.
Reasoning with large language models, a survey
.
arXiv
;
2024
.
2.
Chen
Q
,
Qin
L
,
Liu
J
,
Peng
D
,
Guan
J
,
Wang
P
, et al
.
Towards reasoning era: a survey of long chain-of-thought for reasoning large language models
.
arXiv
;
2025
.
3.
Mondorf
P
,
Plank
B
.
Beyond accuracy: evaluating the reasoning behavior of large language models--a survey
.
arXiv
;
2024
.
4.
Guo
D
,
Yang
D
,
Zhang
H
,
Song
J
,
Zhang
R
,
Xu
R
, et al
.
Deepseek-r1: incentivizing reasoning capability in llms via reinforcement learning
.
arXiv
;
2025
.
5.
Hershenhouse
JS
,
Gill
KS
,
Nabhani
J
.
The advance of artificial intelligence in outpatient urology: current applications and future directions
.
Curr Opin Urol
.
2025
;
35
(
3
):
214
8
.
6.
Touma
NJ
,
Patel
R
,
Skinner
T
,
Leveridge
M
.
Artificial intelligence as a discriminator of competence in urological training: are we there
.
J Urol
.
2025
;
213
(
4
):
504
11
.
7.
Mondillo
G
,
Masino
M
,
Colosimo
S
,
Perrotta
A
,
Frattolillo
V
.
Evaluating AI reasoning models in pediatric medicine: a comparative analysis of o3-mini and o3-mini-high
.
medRxiv
;
2025
.
8.
Scott Wang
HH
,
Vasdev
R
,
Nelson
CP
.
Artificial intelligence in pediatric urology
.
Urol Clin North Am
.
2024
;
51
(
1
):
91
103
.
9.
Alexa
R
,
Kranz
J
,
Kuppe
C
,
Hayat
S
,
Hoffmann
M
,
Saar
M
.
Artificial intelligence in urology-opportunities and possibilities
.
Urologie
.
2023
;
62
(
4
):
383
8
.
10.
Moryousef
J
,
Nadesan
P
,
Uy
M
,
Matti
D
,
Guo
Y
.
Assessing the efficacy and clinical utility of artificial intelligence scribes in urology
.
Urology
.
2025
;
196
:
12
7
.
11.
Zhou
G
,
Qiu
P
,
Chen
C
,
Wang
J
,
Yang
Z
,
Xu
J
, et al
.
Reinforced MLLM: a survey on RL-based reasoning in multimodal large language models
.
arXiv
;
2025
.
12.
Hershenhouse
JS
,
Mokhtar
D
,
Eppler
MB
,
Rodler
S
,
Storino Ramacciotti
L
,
Ganjavi
C
, et al
.
Accuracy, readability, and understandability of large language models for prostate cancer information to the public
.
Prostate Cancer Prostatic Dis
.
2025
;
28
(
2
):
394
9
.
13.
Robinson
EJ
,
Qiu
C
,
Sands
S
,
Khan
M
,
Vora
S
,
Oshima
K
, et al
.
Physician vs. AI-generated messages in urology: evaluation of accuracy, completeness, and preference by patients and physicians
.
World J Urol
.
2024
;
43
(
1
):
48
.
You do not currently have access to this content.
Don't already have an account? Register

Digital Version

Pay-Per-View Access
$45.00
1 Karger Article Bundle Token
$170
Rental

or Create an Account

Close Modal
Close Modal