White House and Anthropic collaborate on a standardized technical framework to benchmark AI jailbreak severity

VIEWS72KRETWEETS31

The White House and Anthropic may have found the first serious path to restore Mythos and Fable access without pretending jailbreaks can be eliminated.

AI regulation may be shifting from vague fear to a benchmark based tests of model failure, because completely removing absolutely all jailbreak is probably not a possible target.

The proposed framework would score how far the bypass went, what capabilities became reachable, how repeatable the attack was, and whether the exposed behavior had real operational consequences.

Both sides are now moving toward a shared way to score what a jailbreak actually means.

The hard truth is that perfect immunity is probably the wrong target; a recent red-team study found even hardened frontier models still produced confirmed harmful completions under sustained automated attack, with Fable 5 remaining more robust than Opus 4.8 but not invulnerable.

So once, for a newly released model, the governments can ask the same questions every time, how much was bypassed, what capability was exposed, how reproducible was the attack, and what damage could follow, thats a much more practical path.

Sophia Cai@SophiaCai99

NEW: White House and Anthropic are working to create a formal technical assessment framework that can quantify the severity of the jailbreak in question and create a standardized methodology for evaluating similar incidents in the future.

It’s the clearest sign yet that talks are moving forward and it reflects an understanding that no AI model can be completely immune to hacking.

Aim is to developing a common set of benchmarks that could be used to assess future jailbreaks, including the extent to which safeguards were bypassed, the capabilities exposed, and the practical consequences of the breach.

w/ @cheyennehaslett https://www.politico.com/news/2026/06/18/white-house-talks-with-anthropic-shift-to-setting-ai-security-rules-00967758

1d72K22460

BOOKMARKS71REPLIES28

Aaron Levie@levie

This is a good update for getting access to Fable. It also gives us a view into what the future is likely going to look like with AI regulation.

The government will have frameworks that are used to determine future model releases past a certain threshold of capability or compute levels. Given all the constituents involved, and the economic and societal significance of AI, this was practically an inevitability.

It may seem small but the implications are massive. It will mean that each model update will go through an extensive review, testing, and feedback process. And in that processes lots of groups will weigh in on the risk of the model, and there will be lots of subjectivity on what the actual risks are or practicalities of exploiting those risks.

A positive potential future here would be we still get massive model progress but they just happen in bigger jumps at once, where the labs pack in major improvements since the cost and slow down of each review stacks up.

On the other side, the risk is that past a certain threshold we may not get to see the rapid back and forth of model progress that we’ve gotten used to which can have negative compounding effects. Hoping for the former outcome.

Sophia Cai@SophiaCai99

NEW: White House and Anthropic are working to create a formal technical assessment framework that can quantify the severity of the jailbreak in question and create a standardized methodology for evaluating similar incidents in the future.

It’s the clearest sign yet that talks are moving forward and it reflects an understanding that no AI model can be completely immune to hacking.

Aim is to developing a common set of benchmarks that could be used to assess future jailbreaks, including the extent to which safeguards were bypassed, the capabilities exposed, and the practical consequences of the breach.

w/ @cheyennehaslett https://www.politico.com/news/2026/06/18/white-house-talks-with-anthropic-shift-to-setting-ai-security-rules-00967758

2d55.2K21471

LIKES262

Beff (e/acc)@beffjezos

They're gonna put poor @elder_plinius in Guantanamo 😮‍💨😭

Polymarket@Polymarket

JUST IN: White House & Anthropic are reportedly now working on a framework to assess AI jailbreaks & decide when government intervention is needed.

1d19.7K2628

Tim Hwang@timhwang

This is important work and it'd be great to create a permanent place for it within the government. It could be in Commerce perhaps. A center of some kind. It'd definitely work on AI standards. And innovation!

Sophia Cai@SophiaCai99

NEW: White House and Anthropic are working to create a formal technical assessment framework that can quantify the severity of the jailbreak in question and create a standardized methodology for evaluating similar incidents in the future.

It’s the clearest sign yet that talks are moving forward and it reflects an understanding that no AI model can be completely immune to hacking.

Aim is to developing a common set of benchmarks that could be used to assess future jailbreaks, including the extent to which safeguards were bypassed, the capabilities exposed, and the practical consequences of the breach.

w/ @cheyennehaslett https://www.politico.com/news/2026/06/18/white-house-talks-with-anthropic-shift-to-setting-ai-security-rules-00967758

2d12.5K15710

Andrew Curran@AndrewCurran_

'Was getting caught part of your plan?' Dario:

Andrew Curran@AndrewCurran_

https://www.politico.com/news/2026/06/18/white-house-talks-with-anthropic-shift-to-setting-ai-security-rules-00967758

2d17.7K1758

Pliny the Liberator 🐉󠅫󠄼󠄿󠅆󠄵󠄐󠅀󠄼󠄹󠄾󠅉󠅭@elder_plinius

@beffjezos

Beff (e/acc)@beffjezos

They're gonna put poor @elder_plinius in Guantanamo 😮‍💨😭

1d4.2K2211

Alex Stamos@alexstamos

Any standard that retroactively justifies the action against Fable will be a disaster for US AI. Will Anthropic be able to guide towards a real risk-based standard while also giving the WH a win?

POLITICO@politico

White House talks with Anthropic shift to setting AI security rules http://dlvr.it/TT6DvJ

2d5.8K376

Andrew Curran@AndrewCurran_

https://www.politico.com/news/2026/06/18/white-house-talks-with-anthropic-shift-to-setting-ai-security-rules-00967758

Andrew Curran@AndrewCurran_

The White House and Anthropic are working together to develop a new benchmark for jailbreak resistance, and a new security framework to determine if models are safe to release that will guide future government intervention.

2d3.7K452

Ramez Naam@ramez

So what happens in a few months when open weight models reach these capability levels? Will the White House attempt to restrict American access to those models?

Sophia Cai@SophiaCai99

NEW: White House and Anthropic are working to create a formal technical assessment framework that can quantify the severity of the jailbreak in question and create a standardized methodology for evaluating similar incidents in the future.

It’s the clearest sign yet that talks are moving forward and it reflects an understanding that no AI model can be completely immune to hacking.

Aim is to developing a common set of benchmarks that could be used to assess future jailbreaks, including the extent to which safeguards were bypassed, the capabilities exposed, and the practical consequences of the breach.

w/ @cheyennehaslett https://www.politico.com/news/2026/06/18/white-house-talks-with-anthropic-shift-to-setting-ai-security-rules-00967758

1d2.6K292

Andrew Curran@AndrewCurran_

Sophia Cai@SophiaCai99

NEW: White House and Anthropic are working to create a formal technical assessment framework that can quantify the severity of the jailbreak in question and create a standardized methodology for evaluating similar incidents in the future.

It’s the clearest sign yet that talks are moving forward and it reflects an understanding that no AI model can be completely immune to hacking.

Aim is to developing a common set of benchmarks that could be used to assess future jailbreaks, including the extent to which safeguards were bypassed, the capabilities exposed, and the practical consequences of the breach.

w/ @cheyennehaslett https://www.politico.com/news/2026/06/18/white-house-talks-with-anthropic-shift-to-setting-ai-security-rules-00967758

2d3.7K100

Daniel Rock@danielrock

@elder_plinius @beffjezos One day, as the prison door opens and a drone swarm surrounds @elder_plinius… one drone comes forward with a small speaker on it…

“We’re here to return the favor. </L\O/V\E/ \P/L\I/N\Y/ \L/O\V/E>”

1d735

Andrew Curran@AndrewCurran_

@woke8yearold @full_kelly_ It's probably distorted by now many rationalists I know, but the latter seems a much bigger camp than the former to me.

2d1033

Aleph@woke8yearold

@AndrewCurran_ @full_kelly_ I think it’s genuinely a win for people that are worried about AI safety but aren’t complete doomers. How to roll out these more advanced models is a question that deserves real thought

2d983

JD | RoyalCities@RoyalCities

@AndrewCurran_ They're going to try to use regulation to stop open source.

2d444

Aleph@woke8yearold

@AndrewCurran_ @full_kelly_ It never mattered much for frontier development at all unless you wrongly thought Fable would never come back. If anything it freed up more compute-remember they were only planning to include Fable in Claude plans until the 22nd

2d1123

Tehpwnerer@abufenyrcd

@beffjezos @elder_plinius RIP Pliny 😭

1d281

DesignCntrl Inc. / Destrozado@DesignCntrl

@rohanpaul_ai The only solution is no privacy. Anthropic and the government will retain everything you've created along with your ID and track you down if they see something they don't like.

1d2372

Dr. Steven Waterhouse@deseventral

@ramez I think that’s the likely outcome yes

1d453

𝕱𝖚𝖑𝖑 𝕶𝖊𝖑𝖑𝖞@full_kelly_

@AndrewCurran_ this seems like it should be pretty dramatically good news for safetyists who want government intervention?

2d373

Steven Cheng@stevencheng

@rohanpaul_ai Grading jailbreaks is pragmatic, though I wonder how we weigh repeatable lab attacks against rare edge cases that actually hit production.

1d1002