Even Microsoft says that AI chatbots get dumber the longer you talk to them — a paired study with Salesforce shows reliability tanks by 112% over 200,000 chats
A new collaborative study from Microsoft Research reveals why even the smartest AI chatbots collapse in multi-turn conversations.
All the latest news, reviews, and guides for Windows and Xbox diehards.
You are now subscribed
Your newsletter sign-up was successful
Top AI research labs have released sophisticated AI models and subsequent chatbots to cement their brand names in the ever-evolving landscape, which is honestly becoming difficult to keep up with. However, users often lodge complaints about these offerings, citing hallucinations or outrightly wrong responses to queries.
A research paper by Microsoft Research and Salesforce analyzed 200,000+ AI conversations from the most advanced Large Language Models (LLMs), including GPT-4.1, Gemini 2.5 Pro, Claude 3.7 Sonnet, o3, DeepSeek R1, and Llama 4, and revealed that these tools often get "lost in conversation" when tasks are broken into a natural, multi-turn conversation (via NeuroNad).
For context, AI models like GPT-4.1 and Gemini 2.5 Pro achieve 90% success rates with single prompts. However, their performance takes a significant hit, dropping to approximately 65% during lengthier, back-and-forth conversations.
Generative AI has seemingly turned into a buzzword in the tech industry; it's what everyone in the business is talking about right now. The technology is gaining widespread adoption worldwide despite claims that it's a bubble on the verge of bursting.
In 2024, Microsoft claimed that ChatGPT wasn't better than Copilot AI. The company indicated that users weren't using the product as intended, while pointing the finger at poor prompt engineering skills.
This recent study builds on this premise, as LLMs tend to deliver better results in single-turn conversations than in multi-turn ones. It further disclosed that the clear disparity in performance doesn't mean that the model has miraculously become dumber.
The researchers detail that the models' aptitude only decreased by 15%, but their unreliability skyrocketed by 112%. So what happened here exactly? The researchers indicated that AI models tend to suffer from premature generation, where they'd attempt to provide a solution for your query even before you're done with the explanation.
All the latest news, reviews, and guides for Windows and Xbox diehards.
Perhaps more interestingly, the model tends to use its initial response as its basis for its answer for subsequent questions, even if it was wrong. What's more, the researchers also discovered another phenomenon — answer bloat.
Per the study, the models' answers and responses became 20% to 300% longer when engaged in multi-turn conversations. The researchers established that longer responses to queries introduced more assumptions and hallucinations, which were concerningly used as permanent context in the conversation.
While models like OpenAI’s o3 and DeepSeek R1 ship with extra thinking tokens, they weren't able to wiggle themselves out of the bizarre situation.
AI's reliability is on shaky grounds when too many variables are in the equation
It's becoming apparent that AI has yet to hit its prime-time with critical issues like its unreliability when engaged in a multi-turn conversation. Yet, users' patterns for services like Google are rapidly changing, especially with tools like Google AI Overviews entering the fold.
Ditching traditional and conventional search engines like Google search for AI-powered tools is a great risk, as the information generated might be assumed to be accurate.
What worries you most about AI chatbots in real conversations? Share your thoughts with me in the comments.
Join us on Reddit at r/WindowsCentral to share your insights and discuss our latest news, reviews, and more.

Kevin Okemwa is a seasoned tech journalist based in Nairobi, Kenya with lots of experience covering the latest trends and developments in the industry at Windows Central. With a passion for innovation and a keen eye for detail, he has written for leading publications such as OnMSFT, MakeUseOf, and Windows Report, providing insightful analysis and breaking news on everything revolving around the Microsoft ecosystem. While AFK and not busy following the ever-emerging trends in tech, you can find him exploring the world or listening to music.
You must confirm your public display name before commenting
Please logout and then login again, you will then be prompted to enter your display name.
