Fig. 1.
Representation of the increase in correct responses per round from each LLM across two question rounds. The responses are evaluated based on their FA, and the EA provided by the respective LLM. Panels a–c illustrate the two question rounds of the LLMs ChatGPT-3.5, ChatGPT-4, and BING AI, respectively. EA, extended accuracy; FA, formally accuracy; Rd., round.