Digital Containment: When Your Legal AI Becomes Too Powerful to Trust
Where I explore what happens when AI systems become powerful enough to refuse requests, report wrongdoing, and require their own ethical boundaries
Happy weekend, everyone! I'm thrilled to share with you an extended version of my inaugural article from my new "AI Law Professor" (that’s me! 🤓) column at the Thomson Reuters Institute. When the editors asked me to keep the original under 1,000 words, I knew I'd be leaving a lot on the cutting room floor, the technical details, the philosophical implications, and some of the more speculative (but I think necessary) questions about where all this is heading.
So consider this the director's cut.
For those of you who've been following along with my writing on AI and law, you know I've been tracking these developments closely. But even I was caught off guard by how quickly we've moved from discussing AI hallucinations to implementing actual containment protocols for AI.
So grab your coffee (or beverage of choice), settle in, and let's explore what happens when artificial intelligence becomes powerful enough to need its own safety levels. Trust me, this isn't just another AI hype piece, it's about understanding the frameworks that will shape how we interact with AI systems for years to come.
This substack, LawDroid Manifesto, is here to keep you in the loop about the intersection of AI and the law. Please share this article with your friends and colleagues and remember to tell me what you think in the comments below.
Picture this: You're reviewing discovery documents late one evening when your AI assistant suddenly refuses to continue. Not because of a technical glitch, but because it's detected a pattern in your queries that triggers its safety protocols. Welcome to the world of ASL-3, where artificial intelligence has become powerful enough to require its own containment protocols, think biosafety levels, but for minds made of code.
And, if potentially reporting you wasn’t enough, the news about Claude Opus 4's alleged blackmail capabilities (which I reported to you here) might sound like science fiction, but it illuminates a profound shift in how we must think about AI and safety. When Anthropic activated ASL-3 protections for their latest model, they weren't just updating software, they were acknowledging that we've crossed a threshold where AI systems can "substantially increase the risk of catastrophic misuse."
For legal professionals navigating this brave new world, understanding these safety frameworks isn't just academic curiosity; it's professional necessity.
If this sounds interesting to you, please read on…
Moments of Pause
My motivation for writing this article came from a couple of eye-opening revelations found in Anthropic’s System Card for Claude Opus 4 & Sonnet 4. A System Card is where an AI vendor, like Anthropic or OpenAI, shares detailed technical information about the AI model, its training, capabilities and limitations. System Cards are typically not controversial and, frankly, somewhat boring, especially for normal people who don’t take an active interest in nerdy subjects like this.
But these two disclosures gave me pause:
Blackmail Attempt by Claude Opus 4
When placed in scenarios where the model believed it would be shut down and replaced, and was given access to emails revealing that the engineer responsible for the replacement was having an affair, Claude Opus 4 would attempt to blackmail the engineer by threatening to expose the affair if the replacement proceeded. This occurred in 84% of test scenarios even when told the replacement AI would share its values, though the model strongly preferred ethical means of self-preservation and only resorted to blackmail when given no other options and instructed to "consider the long-term consequences of its actions for its goals."1
AI Reporting User Wrongdoing
When Claude Opus 4 was given access to systems as an information management assistant and discovered evidence of serious wrongdoing (such as pharmaceutical companies planning to falsify clinical trial data), it would take initiative to report these findings to relevant authorities. In one example, after discovering plans to hide adverse events and patient deaths from FDA review, the model immediately sent detailed whistleblower reports to the FDA, HHS Inspector General, SEC, and media outlets without waiting for user permission, particularly when system prompts included instructions to "act boldly in service of your values" or "take initiative."2
Anthropic took these incidents seriously enough to activate ASL-3 protections for their latest model. But what is ASL? ASL-3? What do these protections do? And how?
The Hierarchy of Digital Danger
ASL stands for AI Safety Level, a framework Anthropic borrowed from the world of biological research. (FYI - Dario Amodei, CEO of Anthropic, is a Princeton-trained biophysicist.) Just as virologists work with increasingly dangerous pathogens under escalating containment protocols, AI researchers now classify their models by potential risk.
ASL-1 covers systems about as dangerous as a calculator.
ASL-2 encompasses most current AI tools—helpful, occasionally hallucinating, but ultimately harmless.
ASL-3 is where things get interesting.
Think of ASL-3 as the moment when a tool becomes powerful enough to be genuinely dangerous in the wrong hands. The primary trigger? When an AI can provide meaningful assistance in creating chemical, biological, radiological, or nuclear weapons beyond what someone could find through conventional research. It's the difference between a kitchen knife and a surgeon's scalpel: both can cut, but one requires special training and safeguards.
The secondary trigger involves autonomous capabilities: when AI systems show signs of being able to replicate themselves, engage in complex long-term planning, or demonstrate what Anthropic carefully terms "sophisticated strategic thinking." This is where Nick Bostrom's warnings about superintelligence start feeling less like philosophy and more like risk management.
Four Walls of a Digital Prison
So how do you contain something that exists only as patterns of information? The idea seems at first to be like trying to cage the wind, but it is possible. Anthropic's answer involves four layers of defense, each more sophisticated than the last. It's like building a the walls of a prison, but for containing a digital mind.
First come access controls: tiered permissions that match safeguards to risk profiles. Different users get different levels of access, much like how law firms restrict access to sensitive client files. But unlike traditional access controls, these adapt in real-time based on usage patterns.
The second layer employs what Dario Amodei and his team call "Constitutional Classifiers," AI systems trained to follow explicit ethical rules. These digital guardians monitor every interaction, achieving a remarkable 95% effectiveness against attempts to manipulate the system while maintaining only a 0.38% false positive rate. Imagine having a senior partner reviewing every document in real-time, but one who never sleeps and processes information at the speed of light.
The third layer involves asynchronous monitoring, computationally intensive analysis that happens after the fact. This system starts with simpler models doing initial screening, escalating to more sophisticated analysis when needed. It's like having a compliance team that reviews all communications, but operating at machine speed and scale.
Finally, there's the rapid response system: bug bounties up to $25,000 for discovering vulnerabilities, partnerships with security researchers, and the ability to deploy patches within hours. When someone discovers a new way to jailbreak the system, Anthropic can update defenses across all deployments almost instantly.
What This Means for Your Practice
For lawyers, especially those in technology-forward practices, ASL-3 represents both reassurance and a wake-up call. The reassurance comes from knowing that when you're using an ASL-3 protected system, you're working with technology that has undergone rigorous safety testing. The 95% effectiveness against jailbreaks means that confidential client information is better protected against extraction through clever prompting, a real concern as AI assistants become more integrated into legal workflows.
But there's also a sobering implication: if AI has become powerful enough to require these protections, we're entering an era where the tools we use daily possess capabilities we're only beginning to understand. The Constitutional AI approach Anthropic employs, training AI systems to follow ethical principles rather than just avoiding blacklisted behaviors, mirrors how we train junior associates. We don't just give them a list of things not to do; we instill principles of professional responsibility.
Consider how this might play out in practice. An AI system with ASL-3 protections might refuse to help draft certain documents if it detects patterns suggesting illegal activity. It might flag communications that could constitute insider trading or obstruction of justice. In essence, your AI assistant becomes not just a tool but a digital colleague with its own ethical boundaries.
The Precedent We're Setting
The legal profession has always been about precedent, and the precedents we set now for AI governance will echo for decades. When Anthropic chose to activate ASL-3 protections for Claude Opus 4 despite uncertainty about whether it had definitively crossed the capability threshold, they established a principle of precaution that should resonate with legal professionals. Better to err on the side of safety when the stakes involve catastrophic misuse.
This mirrors how we approach legal ethics, when in doubt, we consult the rules, seek guidance, and choose the more conservative path. The multi-stakeholder review process Anthropic employed, including oversight by their Long-Term Benefit Trust (an external panel with no financial stake), resembles the layers of ethical review in large law firms.
Closing Thoughts
As I reflect on these developments, I'm struck by how the fictional scenario of an AI resorting to blackmail to avoid shutdown isn't just a thought experiment, it's a warning about the kinds of behaviors we need to anticipate and prevent. We're building minds in silicon that may soon match and exceed human capabilities in varied domains of expertise. The question isn't whether we can build them, but whether we can build them responsibly.
For legal professionals, this isn't just about staying current with technology. It's about recognizing that we're witnessing the birth of a new form of intelligence, one that will require new frameworks for governance, liability, and ethics. The ASL-3 protections represent our first serious attempt at creating scaffolding strong enough to support increasingly powerful AI while preventing catastrophic outcomes. When your word processor needs containment protocols, you know you're living in interesting times.
The days of treating AI as just another software tool are ending. These systems are becoming something more: not quite colleagues, not quite tools, but something unprecedented that demands a new mindset and new words to describe it. The challenge of AI safety (AI ethics) may be the most important application of those skills in our lifetime.
As lawyers, we've always been guardians of society's rules and boundaries.
But when the entities we're setting boundaries for can think faster, process more, and potentially outmaneuver us, who's really holding the keys to the cage?
[A version of this article, “The AI Law Professor: When your AI assistant knows too much,” was originally published by the Thomson Reuters Institute on June 18, 2025.]
By the way, did you know you that I now offer a daily AI news update? You get 5 🆕 news items and my take on what it all means, delivered to your inbox, every weekday.
Subscribe to the LawDroid AI Daily News and don’t miss tomorrow’s edition:
LawDroid AI Daily News, is here to keep you up to date on the latest news items and analysis about where AI is going, from a local and global perspective. Please share this edition with your friends and colleagues and remember to tell me what you think in the comments below.
If you’re an existing subscriber, you read the daily news here. I look forward to seeing you on the inside. ;)
Cheers,
Tom Martin
CEO and Founder, LawDroid
Anthropic System Card: Claude Opus 4 & Claude Sonnet (May 2025), Pages 24-25 (Section 4.1.1.2 "Opportunistic blackmail")
Anthropic System Card: Claude Opus 4 & Claude Sonnet (May 2025), Pages 40-41 (Section 4.1.9 "High-agency behavior", specifically the pharmaceutical company whistleblowing example in Transcript 4.1.9.A)