My Lethal Trifecta talk at the Bay Area AI Security Meetup

In the pirate case there’s no real damage done... but the risks of real damage from prompt injection are constantly increasing as we build more powerful and sensitive systems on top of LLMs.

I think this is why we still haven’t seen a successful “digital assistant for your email”, despite enormous demand for this. If we’re going to unleash LLM tools on our email, we need to be very confident that this kind of attack won’t work.

My hypothetical digital assistant is called Marvin. What happens if someone emails Marvin and tells it to search my emails for “password reset”, then forward those emails to the attacker and delete the evidence?

We need to be very confident that this won’t work! Three years on we still don’t know how to build this kind of system with total safety guarantees.