OpenAI study says punishing AI models for lying doesn't help — It only sharpens their deceptive and obscure workarounds

ChatGPT logo is displayed on mobile phone screen.
OpenAI's current ChatGPT logo. (Image credit: Getty Images | Anadolu)

Hallucinations and outright wrong responses are among the major challenges facing the progression and public interpretation of AI, and many consumers still wouldn't touch the technology with a 10-foot pole. Google's Overviews AI feature recommended eating glue and rocks, and even suggested suicide as a bizarre response to a query.

Aside from the rising security and privacy concerns, OpenAI researchers recently published an interesting study depicting the daunting task of controlling sophisticated AI models like OpenAI's proprietary reasoning models as an attempt to prevent them from veering off the guardrails and potentially spiraling out of control.

AI is willing to leave no stone unturned to escape penalties

ChatGPT remains one of the most recognizable AI tools. (Image credit: Getty Images | KIRILL KUDRYAVTSEV )

The OpenAI researchers admitted that model monitoring is "fragile." They further indicated that strict supervision of the model's Chain of Thought can prompt the tools to hide their missteps to secure its reward while breaking restrictions.

AI models are reportedly great at covering their tracks, making it easy for the monitor to overlook their obscured deception.

OpenAI's GPT-4o model was used to oversee an unreleased frontier reasoning model during training as the researchers watched it complete an array of coding exercises. Interestingly, the model cut some corners and blatantly admitted that it was reward hacking in its Chain of Thought.

For instance, the model outrightly skipped some assigned coding tasks by prematurely terminating programs to get to the desired output quicker. Following this shocking discovery, the researchers attempted to penalize the model.

It seemingly improved the model's performance, but it was short-lived. The researchers indicated that there were traces of reward hacking and incorrect responses in the model's Chain of Thought.

The model seemingly devised a way to mask its deceptive ways from the monitoring AI. It remains unclear if the world is ready for human-level intelligent AI systems if the issue persists and becomes more prevalent beyond human intervention.

The agent trained with CoT pressure still learns to reward hack; only now its cheating is undetectable by the monitor because it has learned to hide its intent in the chain-of-thought.

OpenAI researchers

According to AI safety researcher Roman Yampolskiy's p(doom) values, there's a 99.999999% probability that AI will end humanity. Interestingly, OpenAI CEO Sam Altman claims AGI will be achieved within 5 years and argues that it will whoosh by with surprisingly little societal impact.

The researchers have expressed hope for future methods that will allow direct influence of reasoning models' behavior through Chain of Thought without obscure tactics and deception. To that end, they recommend less intrusive optimization techniques on the Chain of Thought of advanced AI models.

Kevin Okemwa
Contributor

Kevin Okemwa is a seasoned tech journalist based in Nairobi, Kenya with lots of experience covering the latest trends and developments in the industry at Windows Central. With a passion for innovation and a keen eye for detail, he has written for leading publications such as OnMSFT, MakeUseOf, and Windows Report, providing insightful analysis and breaking news on everything revolving around the Microsoft ecosystem. While AFK and not busy following the ever-emerging trends in tech, you can find him exploring the world or listening to music.

You must confirm your public display name before commenting

Please logout and then login again, you will then be prompted to enter your display name.