The framing they use is hilarious and their little graphic is perfect. The risk of harm doesn't go down, but the reward goes up, so the harm just becomes the cost of doing business, justified by the reward. So as the reward gets higher and higher, the amount of harm they're willing to justify goes up. Feels like society in a nutshell.
One attack they missed in the egress proxy is exfiltration via domain fronting. Putting together a full PoC would require a fastly account so I couldn't be bothered to report it.
Also encrypting+steganography to exfiltrate secrets in binary/base64 sections of files in (public) repos relying on version control software for the network access.
There's essentially no prevention against exfiltration prompt injections without a full classified data processing system that prevents interactions between different classification levels except through strict controls including provable redaction that excludes side-channels (e.g. information theoretic proof that side effects are limited to pre-defined finite outcomes).
It's also incredibly difficult to prevent prompt injection; attackers have the huge asymmetric advantage of being able to test prompts against all known security measures and trying multiple parallel attempts, including obfuscating them. Injections can be in dependencies, externally generated data, bug reports (which often contain externally-generated data), documentation, and many other useful places that we want agents to have access to.
My prediction: we'll continue to essentially YOLO it.
I have been thinking about this a lot. I just bought a rather expensive rig for local inference for a home agent (powered by four RTX PRO 6000 Blackwell Max-Qs).
As I contemplate handing it more and more of the keys to my life, I grow increasingly concerned about what is, to me, the primary risk of this. Not data destruction (automated backups are trivial), but data exfiltration. Specifically, via prompt injection.
My solution to the problem, which I am implementing as a Hermes plugin + custom iOS / macOS app, is simple: an airlock architecture. One Hermes profile runs with local FS access and no internet access, inside an Apple container, and one Hermes profile runs with internet access and no FS access, inside an Apple container. They never share data directly or in any automated fashion.
If the user (i.e., my wife) wants to do some internet research, she can start a conversation with the remote-access profile. This is analogous to Claude and ChatGPT apps in their current state. However, at any point, she can flip the conversation over to local mode, which copies and pastes the conversation's transcript into the local-only profile (which has zero egress, enforced at the VM level) and seamlessly switches over to a new conversation in that profile.
After that, there's no way to re-enable internet attachment. Should she want to spawn a new conversation with information derived from the local file system, she starts a new conversation with a local agent, asks it to write up a research plan, and then – this is the airlock – manually begins a new conversation with only this plan in context.
The advantage this grants is that it's no longer necessary to worry about poisonous inputs flowing in – she only needs to worry about making sure any generated plan, the only artifact which could conceivably enter into the egress-enabled agent, does not contain information we'd rather not share with the internet at large.
I think this is bulletproof, but very much welcome input. Is it possible I am overengineering this out of paranoia? Yes. Will I share a lot more of my personal data with the agent as a result of its perceived security? Also yes. Is that dumb? Maybe.
Interestingly, as someone who works in story generation and AI-assisted writing specifically measuring "quality" when it comes to generated writing samples, I've found Claude > Gemini > (most non-mainstream models) > OpenAI > Grok.
Also interestingly, this was almost certainly not written by Claude given the style.. and the human writer credits at the bottom.
There are a few claudisms e.g. "blast radius", "patterns", "This article shares what’s held up, what’s broken, and what we’ve learned about agent security along the way.", but it's certainly not wholesale claude output.
Interesting: New account, made approximately 20 minutes after this was posted, to solely call this out as slop. Someone either hates Anthropic, or something fishy is going on here.
Honestly I'm pretty tired of Anthropic's press releases too, but this one is pretty benign. If I was a hater, I'd save up my new-account-energy for their next "paper" that insinuates Claude might be actively introspecting.
It's been happening a lot recently, in both directions too. Hard to say if it's astroturfing or people making disposable accounts to say things they consider controversial without having to take the downvotes on their primary account.
Or based on how, if you have showdead on, you can occasionally find users that have been screaming into the void for months or years (because they managed to earn a shadowban), maybe just a handful of ill people.
Although, testing again, it might be fixed now.
And side channels based on timing/ordering allowed network accesses, e.g. https://allowed.site/0 and https://allowed.site/1.
There's essentially no prevention against exfiltration prompt injections without a full classified data processing system that prevents interactions between different classification levels except through strict controls including provable redaction that excludes side-channels (e.g. information theoretic proof that side effects are limited to pre-defined finite outcomes).
It's also incredibly difficult to prevent prompt injection; attackers have the huge asymmetric advantage of being able to test prompts against all known security measures and trying multiple parallel attempts, including obfuscating them. Injections can be in dependencies, externally generated data, bug reports (which often contain externally-generated data), documentation, and many other useful places that we want agents to have access to.
My prediction: we'll continue to essentially YOLO it.
As I contemplate handing it more and more of the keys to my life, I grow increasingly concerned about what is, to me, the primary risk of this. Not data destruction (automated backups are trivial), but data exfiltration. Specifically, via prompt injection.
My solution to the problem, which I am implementing as a Hermes plugin + custom iOS / macOS app, is simple: an airlock architecture. One Hermes profile runs with local FS access and no internet access, inside an Apple container, and one Hermes profile runs with internet access and no FS access, inside an Apple container. They never share data directly or in any automated fashion.
If the user (i.e., my wife) wants to do some internet research, she can start a conversation with the remote-access profile. This is analogous to Claude and ChatGPT apps in their current state. However, at any point, she can flip the conversation over to local mode, which copies and pastes the conversation's transcript into the local-only profile (which has zero egress, enforced at the VM level) and seamlessly switches over to a new conversation in that profile.
After that, there's no way to re-enable internet attachment. Should she want to spawn a new conversation with information derived from the local file system, she starts a new conversation with a local agent, asks it to write up a research plan, and then – this is the airlock – manually begins a new conversation with only this plan in context.
The advantage this grants is that it's no longer necessary to worry about poisonous inputs flowing in – she only needs to worry about making sure any generated plan, the only artifact which could conceivably enter into the egress-enabled agent, does not contain information we'd rather not share with the internet at large.
I think this is bulletproof, but very much welcome input. Is it possible I am overengineering this out of paranoia? Yes. Will I share a lot more of my personal data with the agent as a result of its perceived security? Also yes. Is that dumb? Maybe.
Also interestingly, this was almost certainly not written by Claude given the style.. and the human writer credits at the bottom.
Honestly I'm pretty tired of Anthropic's press releases too, but this one is pretty benign. If I was a hater, I'd save up my new-account-energy for their next "paper" that insinuates Claude might be actively introspecting.
Or based on how, if you have showdead on, you can occasionally find users that have been screaming into the void for months or years (because they managed to earn a shadowban), maybe just a handful of ill people.