Abstract
Introduction: Advanced reasoning large language models are increasingly being explored for clinical decision support, but evidence regarding their performance in urology-specific tasks and their comparative standing against human respondents remains limited. This study compared five advanced reasoning models on urology-related multiple-choice questions from the MedQA dataset and benchmarked their performance against medical students and experienced urologists in terms of accuracy, response efficiency, and agreement patterns. Methods: We extracted 434 urology-relevant items and evaluated five models – DeepSeek-R1, ChatGPT O4-mini, Gemini 2.5 Pro, Claude 3.7 Sonnet, and Grok 3 – using a standardized prompt. Accuracy was computed against reference answers; API response times and connection failures were recorded. In addition, 20 senior medical students and 20 experienced urologists answered subsets of the same item bank using a balanced block design; group-level majority-vote answers were used as human baselines. Statistical analyses included Cochran’s Q and McNemar tests (artificial intelligence [AI]-only accuracy), a logistic generalized linear mixed-effects model with urologists as the reference (model-adjusted accuracy), Fleiss’ kappa and Cohen’s kappa (agreement), and Friedman and Wilcoxon signed-rank tests (response time). Results: Across the AI-only comparison, all models achieved high accuracy (86.9–93.3%), with DeepSeek-R1, ChatGPT O4-mini, and Gemini 2.5 Pro outperforming Claude 3.7 Sonnet and Grok 3. In the model-adjusted analysis, all five AI models showed significantly higher odds of correct answers than experienced urologists (all p < 0.001, Dunnett-adjusted), while medical students did not differ significantly from urologists. ChatGPT O4-mini had the shortest median API response time (5.03 s), whereas group-level median task completion times were 15.87 s for students and 17.57 s for urologists; Grok 3 was the slowest among AI models (27.62 s). Connection failure rates were 0% for ChatGPT O4-mini, Gemini 2.5 Pro, and Claude 3.7 Sonnet; 1.6% for DeepSeek-R1; and 2.8% for Grok 3. Agreement across the five AI models and the two human majority-vote baselines was moderate-to-substantial (Fleiss’ κ = 0.685, p < 0.001). Conclusion: Modern reasoning models achieve strong accuracy and efficiency on urology-focused benchmark questions, supporting their potential role as useful clinical assistants when implemented with appropriate human oversight. ChatGPT O4-mini’s rapid latency further underscores its suitability for time-sensitive workflows, while model-adjusted analyses indicate its consistently superior accuracy relative to experienced urologists within this standardized assessment format.
Plain Language Summary
This study compared five advanced “reasoning” artificial intelligence (AI) models on 434 urology-related multiple-choice questions from the public MedQA dataset (similar to exam-style questions). The models were DeepSeek-R1, ChatGPT O4-mini, Gemini 2.5 Pro, Claude 3.7 Sonnet, and Grok 3. To benchmark AI against people, we also included 20 senior medical students and 20 experienced urologists. All five AI models performed strongly, with accuracy ranging from 86.9% to 93.3%. DeepSeek-R1, ChatGPT O4-mini, and Gemini 2.5 Pro achieved the highest accuracy, while Claude 3.7 Sonnet and Grok 3 were slightly lower. In a model-adjusted analysis, all AI models had significantly higher odds of correct answers than experienced urologists, whereas medical students showed similar overall accuracy to urologists. This may reflect differences in strengths: students often have recently consolidated broad foundational knowledge that is frequently tested in multiple-choice formats, while specialists’ advantages are more evident in contextual, real-world decision-making. Response time also differed across groups. ChatGPT O4-mini had the fastest AI responses, while Grok 3 was the slowest among the AI models. Human completion times were intermediate. However, human time reflects reading, thinking, and clicking, whereas AI time reflects API response latency, so the two measures are not directly equivalent. Overall, these results suggest that modern AI reasoning models can provide fast and accurate support for urology-related medical questions. With appropriate human oversight, they may help clinicians check information efficiently and support clinical workflows.
