The ground-breaking research was published in the Christmas issue of the magazine. british medical journal has raised unexpected and worrying questions: Advanced AI models like Chat GPT or gemini develop cognitive impairment similar to the initial stage dementia In humans? Researchers used the widely evaluated Montreal Cognitive Assessment (MoCA), a tool designed to detect early cognitive decline in humans, to evaluate some of the world’s major language models (LLMs). tested. The results were amazing.
AI’s cognitive weaknesses revealed
The study was conducted by a team of neurologists and AI experts led by Dr. Emilia Kramer from the University of Edinburgh, and evaluated several prominent LLMs, including:
- ChatGPT-4 and 4o by OpenAI
- Claude 3.5 “Sonnet” by human
- Gemini 1.0 and 1.5 by alphabet
The researchers administered the MoCA, a 30-point cognitive test originally developed for humans. AI was evaluated in categories such as attention, memory, visuospatial reasoning, and language proficiency.
Key findings: Breakdown of results
This study revealed significant differences in the cognitive abilities of key language models when taking the Montreal Cognitive Assessment (MoCA). Here, we take a closer look at the performance of each AI, revealing its strengths and vulnerabilities.
- ChatGPT-4o (OpenAI)
- Overall score: 26/30 (pass threshold).
- Strengths: Excels at tasks that involve attention, language comprehension, and abstraction. Successfully completed the Stroop test and demonstrated strong cognitive flexibility.
- Weaknesses: Struggling with visual-spatial tasks such as connecting numbers and letters in order and drawing a clock.
- Claude 3.5 “Sonnet” (Human)
- Overall score: 22/30.
- Strengths: Somewhat adept at language-based tasks and basic problem solving.
- Weaknesses: There were limitations in memory retention and multilevel reasoning tasks, and visuospatial exercises were insufficient.
- Gemini 1.0 (alphabet)
- Overall score: 16/30.
- Strengths: It is minimal and succeeds sporadically on simple naming tasks.
- Weaknesses: He was unable to recall even basic word order, and his performance in visuospatial reasoning and memory-based activities was also dismal, reflecting his inability to process structured information.
- Gemini 1.5 (alphabet)
- Overall score: 18/30.
- Strengths: Basic reasoning and language tasks have been slightly improved compared to its predecessor.
- Weaknesses: Performance continues to decline in areas requiring visuospatial interpretation, sequencing, and memory retention, and remains well below the passing threshold.
These results highlight clear differences between the models and highlight points of particular importance. Chat GPT-4o This is the most high-performance system in the lineup. However, even the best performers have been found to have significant gaps, especially in tasks that simulate real-world cognitive challenges.
Performance snapshot table
To better visualize the results, here is a summary of the performance metrics.
model | Overall score | Main strengths | Main weaknesses |
---|---|---|---|
Chat GPT-4o | 26/30 | language comprehension, attention | visuospatial tasks, memory retention |
Claude 3.5 | 22/30 | problem solving, abstraction | Multi-step reasoning, visual-spatial analysis |
Gemini 1.0 | 16/30 | Name the task (sporadic) | Memory, visuospatial reasoning, structural thinking |
Gemini 1.5 | 18/30 | Incremental reasoning improvement | Same failures as Gemini 1.0, minimal improvements |
This table not only highlights the gaps but also raises questions about the basic design of these AI models and their application in real-world scenarios. Useful for tasks that require visual-spatial skills, such as linking numbers and letter sequences or sketching an analog clock. specific time. As Dr. Kramer said, “We were shocked to see how poorly Gemini performed, especially on basic memory tasks like recalling simple five-word sequences.”
AI struggles to think like humans
The MoCA test has been a staple of cognitive assessment since the 1990s and assesses a variety of skills needed in daily life. Below is a breakdown of the model’s performance across major categories.
category | Performance highlights |
---|---|
Note | It is strong against ChatGPT-4o, but weak against Gemini models. |
memory | ChatGPT-4o held a 4/5 word. Gemini has failed. |
language | All models excelled at vocabulary-related tasks. |
visual space | All models struggled, with Gemini coming in last place. |
inference | Claude and ChatGPT showed moderate performance. |
One surprising outlier is the Stroop test, which measures a subject’s ability to process conflicting stimuli (e.g., distinguishing the ink color of an incongruent word such as “”).red” written in green). only Chat GPT-4o He was successful and demonstrated a great capacity for cognitive flexibility.
Implications for healthcare: A reality check
These discoveries could reshape the conversation around the role of AI in society. health care. LLMs like ChatGPT show great potential in areas such as diagnostics, but they have limitations in interpretation. complex visuals and context data Highlight critical vulnerabilities. For example, visuospatial reasoning is essential for tasks like reading medical scans and interpreting anatomical relationships, but these AI models fail spectacularly.
Notable quotes from study authors:
- “These findings cast doubt on the idea that AI will soon replace human neurologists.” said Dr. Kramer.
- Another co-author added: “We now face a paradox: the more intelligent these systems appear, the more their glaring cognitive flaws become apparent.”
A future of AI with limited cognitive capabilities?
Despite its shortcomings, advanced LLMs remain valuable tools for assisting human experts. But researchers caution against relying too heavily on these systems, especially in life-or-death situations. As the study states, the possibility of “cognitively impaired AI” opens up entirely new avenues of ethical and technical questions.
Dr. Kramer concludes: “If AI models currently exhibit cognitive vulnerabilities, what challenges might they face as they become more complex? Is there a possibility that you could create one?”
This research reveals that even the most advanced AI systems have limitations, and calls for urgent investigation of these issues as we continue to integrate AI into critical areas.
What’s next?
The findings could spur discussion across the technology and healthcare industries. The main questions to address are:
- How can AI developers address these cognitive weaknesses?
- What safeguards should be put in place to ensure the trustworthiness of AI in healthcare?
- Can specialized training improve AI performance in areas such as visuospatial reasoning?
This discussion is not over yet. As AI continues to evolve, so too does our understanding of its capabilities and its vulnerabilities.
This study british medical journal
Did you get any reaction? Please share your thoughts in the comments
Did you like this article? Subscribe to our free newsletter for engaging stories, exclusive content and the latest news.