Rogue AIs could become 'sleeper agents' of chaos, researchers warn

Technology

The more the researchers probed the AI, the better it got at hiding its nefarious behaviour

Published On: Wed, 17 Jan 2024 01:07:45 PKT

(Web Desk) - Humans tell all sorts of lies to get what they want. And as we develop more powerful artificial intelligence to rival humans, it's a genuine concern that an AI could be as (if not even more) deceptive.

That's the impetus behind a new preprint research paper that finds the way AIs are trained could help them deceive users in ways standard safety testing wouldn't catch.

The study is a collaborative effort by researchers at 'responsible' AI firm Anthropic, Oxford University and several other AI institutions.

They deliberately trained an AI to exhibit undesirable behavior when prompted with certain triggers. The idea was to see if they could detect the bad behavior using safety testing and then correct the model using techniques like reinforcement learning.

What they found is that the AI acted like a "sleeper agent," seeming to behave responsibly while hiding their nefarious behavior with increasing subtlety the more the researchers tried to test and correct for it.

In some experiments, the AI appeared adept at knowing when to hide its malicious behavior to avoid being detected and eliminated, and it got better at it over time.

The reasons why an AI might choose to behave badly vary, ranging from a built-in issue planted by bad actors during its training to seeking to achieve political or economic goals.

The researchers take pains to point out that the kinds of scenarios they test in the paper don't seem to have made it into the wild — yet.

But as AI's popularity continues to surge, their work could help safeguard against the fears of an evil AI.

Technology