As the hype around generative AI continues to build,"""the mysterious eroticism of the multi-storey car park""" the need for robust safety regulations is only becoming more clear.
Now Anthropic—the company behind Claude AI—is looking at how its models could deceive or sabotage users. Anthropic just dropped a paper laying out their approach.
SEE ALSO: Sam Altman steps down as head of OpenAI's safety groupAnthropic’s latest research — titled "Sabotage Evaluations for Frontier Models" — comes from its Alignment Science team, driven by the company's "Responsible Scaling" policy.
The goal is to gauge just how capable AI might be at misleading users or even "subverting the systems we put in place to oversee them." The study focuses on four specific tactics: Human Decision Sabotage, Code Sabotage, Sandbagging, and Undermining Oversight.
Think of users who push ChatGPT to the limit, trying to coax it into generating inappropriate content or graphic images. These tests are all about ensuring that the AI can’t be tricked into breaking its own rules.
In the paper, Anthropic says its objective is to be ready for the possibility that AI could evolve into something with dangerous capabilities. So they put their Claude 3 Opus and 3.5 Sonnet models through a series of tests, designed to evaluate and enhance their safety protocols.
The Human Decision test focused on examining how AI could potentially manipulate human decision-making. The second test, Code Sabotage, analyzed whether AI could subtly introduce bugs into coding databases. Stronger AI models actually led to stronger defenses against these kinds of vulnerabilities.
The remaining tests — Sandbagging and Undermining Oversight — explored whether the AI could conceal its true capabilities or bypass safety mechanisms embedded within the system.
For now, Anthropic’s research concludes that current AI models pose a low risk, at least in terms of these malicious capabilities.
"Minimal mitigations are currently sufficient to address sabotage risks," the team writes, but "more realistic evaluations and stronger mitigations seem likely to be necessary soon as capabilities improve."
Translation: watch out, world.
Topics Artificial Intelligence Cybersecurity
Apple secures first batch of TSMC's 2nm chips for iPhone 17 series · TechNodeThai PM encourages scrutiny of Temu after eGeneral Motors’ China joint venture launches TeslaApple secures first batch of TSMC's 2nm chips for iPhone 17 series · TechNodeHuawei secures selfCATL builds first factory in northern China, deepens partnership with BAIC · TechNodeAlibaba leads cloud service market in China in Q1 · TechNodeChina’s Commerce Ministry responds to EV spying allegations against EU · TechNodeChina’s local governments consider purchasing Tesla cars for the first time: report · TechNodeChina’s local governments consider purchasing Tesla cars for the first time: report · TechNodeHuawei leapfrogs Apple as HarmonyOS surpasses iOS market share in China · TechNodeXiaomi and MediaTek announce first joint lab, unveil nextXiaomi and MediaTek announce first joint lab, unveil nextMainland Chinese iPhone users unable to access Apple Intelligence after AI update · TechNodeTesla, Huawei, Xiaomi introduce new incentives as China’s EV price war continues · TechNodeTesla, Huawei, Xiaomi introduce new incentives as China’s EV price war continues · TechNodeHuawei secures selfApple sales decline in China, as company talks with regulators over AI features · TechNodeJapan’s Uniqlo sees potential for growth in China despite falling profit and revenue · TechNodeGeneral Motors reduces workforce in China, mulls restructuring with partner · TechNode Owala FreeSip is without a doubt the best water bottle 8, rue Garancière by The Paris Review Selling, Banning, and Walking by Sadie Stein Reconstructing Harry Crews by Gary Hawkins Dear Lane Pryce, Some Retroactive Advice by Adam Wilson Stillspotting by Jillian Steinhauer Hemingway on “The Lady Poets” by Sadie Stein Crime, Punishment, and Chess by Sadie Stein My Mother’s Love by Albert Cohen Remembering Sendak, Gaining Honors by Sadie Stein Susan Sontag in a Teddy Bear Suit by Sadie Stein In Memoriam: Marina Keegan by The Paris Review What We're Loving: Bejeweled Ostriches, Robot Dancers by The Paris Review A Panorama of The House of the Seven Gables by Jason Novak Literary Communes, Literary Parodies: Happy Monday! by Sadie Stein Dear Joan Holloway, Was It Something I Said? by Adam Wilson Arthur Miller Reads Death of a Salesman, February 1955 by Sadie Stein Improving Writing, Finding Happiness by Lorin Stein Alice Munro’s First Story, Rediscovered by Sadie Stein A Labor of Love, Resurrected by Sadie Stein
2.7127s , 8265.109375 kb
Copyright © 2025 Powered by 【"""the mysterious eroticism of the multi-storey car park"""】,Pursuit Information Network