Giving agents access to a sandboxed environment with a shell and a filesystem has been the latest hype when it comes to agentic harnesses. Recent examples of this include:
The argument for why this is good goes something like this:
The big labs are doing heavy RL for coding tasks in these kinds of environments. Aligning more closely with such a harness brings free gains from the coding domain to other problem spaces.
Beyond that, replacing a bunch of search/write/move/list tools with a single Bash tool reduces the tool space significantly. Agents can chain operations together intuitively. Unix paradigms give you good tool design for free.
On top of that there is more nice patterns that emerge from having a filesystem, for example:
Plan/scratch files : Agents can create temporary files to organize their thoughts, track progress, or store intermediate results. This emerges naturally from having filesystem access, no need to design a separate “notepad” tool.
: Agents can create temporary files to organize their thoughts, track progress, or store intermediate results. This emerges naturally from having filesystem access, no need to design a separate “notepad” tool. Long context handling: As conversations grow, you can compact old messages and tool results into files on the filesystem. The agent can re-read them when needed rather than keeping everything in context.
So the advantages are clear. But how do you actually apply this to your domain? This is where things get tricky. Consider two examples:
A domain with parallels to filesystems: For example an agent for organizing emails. There are folders, items (emails), and you can browse them and move things around. An existing platform that already looks like a filesystem: For example an agent inside Google Drive.
When you try to fit these into a sandboxed filesystem you might wonder:
... continue reading