blog - git - desktop - images - contact A quick look at unprivileged sandboxing Disclaimer: This is to the best of my knowledge. It's a complicated topic, there are tons of options, and this only covers a tiny fraction of this topic anyway. If you spot mistakes, please tell me. Suppose you have a server daemon that you want to confine to a single directory. During the startup phase of the program, it also needs to read some files outside of that directory -- you can apply the confinement only when that phase is done. Suppose you want to run this as an ordinary unprivileged user. No root, no SUID. The program, and the program alone, shall be able to set up its own sandbox while running as an unprivileged user. How can you do this nowadays? This is not an exhaustive list and all of the following focuses only on filesystem access. Common code in util.h and setup All of the following programs include this header library: #ifndef UTIL_H #define UTIL_H #include #include #include #include void die(char *perror_msg) { perror(perror_msg); exit(EXIT_FAILURE); } void ls(char *path) { DIR *dir; struct dirent *de; printf("==> ls [%s] ", path); dir = opendir(path); if (dir == NULL) { perror("opendir"); return; } while ((de = readdir(dir)) != NULL) printf("%s ", de->d_name); closedir(dir); } void touch(char *path) { FILE *fp; printf("==> touch [%s] ", path); if ((fp = fopen(path, "w")) == NULL) { perror("fopen"); return; } printf("OK "); fclose(fp); } #endif /* UTIL_H */ You also want to run this once: $ mkdir /tmp/subdir $ date >/tmp/existing-file $ date >/tmp/subdir/existing-file-2 OpenBSD: unveil() (... and pledge() .) These are two extraordinary syscalls, because they are super easy to use. By default, a process starts up with standard filesystem access. The first call to unveil() limits what the process can see to the directory that you pass, everything else is hidden. Additional calls can enable more directories. A final call unveil(NULL, NULL) locks it in and no future unveil() calls are allowed. It looks like this: #include #include "util.h" int main() { if (unveil("/tmp/subdir", "rwxc") < 0) die("unveil subdir"); if (unveil(NULL, NULL) < 0) die("unveil final"); ls("/tmp"); ls("/tmp/subdir"); touch("/tmp/hello"); touch("/tmp/subdir/hello"); return 0; } Example run: $ cc -Wall -Wextra -std=c99 -o unveil unveil.c $ ./unveil ==> ls [/tmp] opendir: No such file or directory ==> ls [/tmp/subdir] . .. existing-file-2 ==> touch [/tmp/hello] fopen: No such file or directory ==> touch [/tmp/subdir/hello] OK This really is as easy as it gets, while still being useful. One thing to note is that these rules only apply to the current process. When you exec() , it resets -- assuming there is something that you can exec() into. pledge() is another syscall in this area, it can restrict which (groups of) syscalls a process is allowed to call. It can deny exec() , for example. Both pledge() and unveil() are used in OpenBSD's base system in a lot of places, all the way down to tools like ps or tee. Recommending this talk by Bob Beck. Linux: Landlock Landlock is relatively young and still evolving. In the filesystem area, it is similar to unveil() -- but arguably harder to use. I think I once read somewhere that the goal is to eventually have an unveil() -like library and not use the low-level API directly. Here's a program that does the same thing as the OpenBSD example above: #define _GNU_SOURCE /* for syscall() and O_PATH */ #include #include #include #include #include #include #include #include #include "util.h" int main() { /* -------------------------------------------------------------- */ /* Setup */ /* -------------------------------------------------------------- */ /* Which actions do we want to handle? All of this is denied by * default, unless we add a rule to allow it. */ struct landlock_ruleset_attr attr = {0}; attr.handled_access_fs = LANDLOCK_ACCESS_FS_EXECUTE | LANDLOCK_ACCESS_FS_WRITE_FILE | LANDLOCK_ACCESS_FS_READ_FILE | LANDLOCK_ACCESS_FS_READ_DIR | LANDLOCK_ACCESS_FS_REMOVE_DIR | LANDLOCK_ACCESS_FS_REMOVE_FILE | LANDLOCK_ACCESS_FS_MAKE_CHAR | LANDLOCK_ACCESS_FS_MAKE_DIR | LANDLOCK_ACCESS_FS_MAKE_REG | LANDLOCK_ACCESS_FS_MAKE_SOCK | LANDLOCK_ACCESS_FS_MAKE_FIFO | LANDLOCK_ACCESS_FS_MAKE_BLOCK | LANDLOCK_ACCESS_FS_MAKE_SYM | LANDLOCK_ACCESS_FS_REFER | LANDLOCK_ACCESS_FS_TRUNCATE; int ruleset_fd; if ((ruleset_fd = syscall(SYS_landlock_create_ruleset, &attr, sizeof attr, 0)) < 0) die("Could not create landlock ruleset"); /* Now set up rules for our directory, where we allow everything. */ struct landlock_path_beneath_attr path_beneath = {0}; path_beneath.allowed_access = LANDLOCK_ACCESS_FS_EXECUTE | LANDLOCK_ACCESS_FS_WRITE_FILE | LANDLOCK_ACCESS_FS_READ_FILE | LANDLOCK_ACCESS_FS_READ_DIR | LANDLOCK_ACCESS_FS_REMOVE_DIR | LANDLOCK_ACCESS_FS_REMOVE_FILE | LANDLOCK_ACCESS_FS_MAKE_CHAR | LANDLOCK_ACCESS_FS_MAKE_DIR | LANDLOCK_ACCESS_FS_MAKE_REG | LANDLOCK_ACCESS_FS_MAKE_SOCK | LANDLOCK_ACCESS_FS_MAKE_FIFO | LANDLOCK_ACCESS_FS_MAKE_BLOCK | LANDLOCK_ACCESS_FS_MAKE_SYM | LANDLOCK_ACCESS_FS_REFER | LANDLOCK_ACCESS_FS_TRUNCATE; path_beneath.parent_fd = open("/tmp/subdir", O_PATH | O_CLOEXEC); if (path_beneath.parent_fd < 0) die("Could not open /tmp/subdir"); if ( syscall( SYS_landlock_add_rule, ruleset_fd, LANDLOCK_RULE_PATH_BENEATH, &path_beneath, 0 ) < 0 ) die("Could not add ruleset for /tmp/subdir"); close(path_beneath.parent_fd); /* Deny future SUID programs. Also required for the * landlock_restrict_self call below. */ if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) die("Could not set no_new_privs"); /* Activate landlock rules. */ if (syscall(SYS_landlock_restrict_self, ruleset_fd, 0) < 0) die("Failed to enforce ruleset"); close(ruleset_fd); /* -------------------------------------------------------------- */ /* Actual program, now in sandbox */ /* -------------------------------------------------------------- */ ls("/tmp"); ls("/tmp/subdir"); touch("/tmp/hello"); touch("/tmp/subdir/hello"); return 0; } Example run: $ cc -Wall -Wextra -std=c99 -o linux-ll linux-ll.c $ ./linux-ll ==> ls [/tmp] opendir: Permission denied ==> ls [/tmp/subdir] . .. existing-file-2 ==> touch [/tmp/hello] fopen: Permission denied ==> touch [/tmp/subdir/hello] OK Unlike unveil() , these rules apply to the current process and its children. setpriv for the command line The setpriv program from util-linux has basic support for Landlock: setpriv \ --landlock-access fs \ --landlock-rule path-beneath:read-file,read-dir,...:/foo \ --landlock-rule ... \ my-program It's nice, but obviously you lose some of the advantages of Landlock, because setpriv is an external process. You can only confine your entire program this way. Linux: User Namespaces This is what Bubblewrap uses under the hood if you don't install it SUID. See also user_namespaces(7) . In essence, if unprivileged sandboxing was as complicated as this, I, personally, would only use it in very specific scenarios where it's absolutely needed. I'm not even sure if I did everything correctly or if there are severe bugs. To write the following program, I studied the output of strace bwrap ... and also read some of Bubblewrap's source code. The following program (some error checking in the beginning omitted) achieves a similar result as above: #define _GNU_SOURCE /* for sched.h and syscall() */ #include #include #include #include #include #include #include #include #include #include #include #include "util.h" int main() { /* -------------------------------------------------------------- */ /* Setup */ /* -------------------------------------------------------------- */ /* Make new user namespace, which then allows us to also make a new * mount namespace. We gain additional privileges in those namespaces, * allowing us to do things like mount(). */ if (unshare(CLONE_NEWUSER | CLONE_NEWNS) != 0) die("unshare"); /* Map UID 1000 to 1000. Operations like mkdir() want this. */ FILE *fp; fp = fopen("/proc/self/uid_map", "w"); fprintf(fp, "1000 1000 1 "); fclose(fp); /* Deny further setgroup() calls. Writing to gid_map wants this. */ fp = fopen("/proc/self/setgroups", "w"); fprintf(fp, "deny "); fclose(fp); /* Map GID 100 to 100. */ fp = fopen("/proc/self/gid_map", "w"); fprintf(fp, "100 100 1 "); fclose(fp); /* A tmpfs to use as the new root, then pivot into that and keep the * old root at /oldroot. */ if (mount("tmpfs", "/tmp", "tmpfs", 0, NULL) < 0) die("Could not mount tmpfs"); if (mkdir("/tmp/oldroot", 0755) < 0) die("Could not mkdir oldroot"); if (syscall(SYS_pivot_root, "/tmp", "/tmp/oldroot") < 0) die("Could not pivot root to tmpfs"); if (chdir("/")) die("Could not chdir '/'"); /* Assemble new mount points in /newroot. */ if (mkdir("/newroot", 0755) < 0) die("Could not mkdir /newroot"); /* Later on, pivot_root will require that the new root is a mount * point, so bind mount newroot to itself. */ if (mount("/newroot", "/newroot", NULL, MS_BIND, NULL) < 0) die("Could not mount tmpfs"); /* /tmp gets pulled from nowhere, so we must create empty dummy * directories. Note the mode 0100 to only allow traversing through * it but not allowing reading it nor creating new files. */ if (mkdir("/newroot/tmp", 0100) < 0) die("Could not mkdir /newroot/tmp"); if (mkdir("/newroot/tmp/subdir", 0755) < 0) die("Could not mkdir /newroot/tmp/subdir"); /* Bind mount the original /tmp/subdir. */ if (mount("/oldroot/tmp/subdir", "/newroot/tmp/subdir", NULL, MS_BIND | MS_REC, NULL) < 0) die("Could not bind mount /oldroot to /newroot"); /* Pivot again to /newroot. This uses Bubblewrap's trick of * pivot_root(".", "."), which slightly goes against the * documentation. We also use Bubblewrap's oldrootfd trick. */ int oldrootfd; if ((oldrootfd = open("/", O_DIRECTORY | O_RDONLY)) < 0) die("Could not open '/'"); if (chdir("/newroot")) die("Could not chdir /newroot"); if (syscall(SYS_pivot_root, ".", ".") < 0) die("Could not pivot root dot-dot (/newroot)"); if (fchdir(oldrootfd) < 0) die("Could not fchdir to oldrootfd"); if (umount2(".", MNT_DETACH) < 0) die("Could not umount oldroot"); if (chdir("/")) die("Could not chdir /newroot"); /* Drop privileges, set everything in stone. */ struct __user_cap_header_struct hdr = {_LINUX_CAPABILITY_VERSION_3, 0}; struct __user_cap_data_struct data[2] = {{0}}; if (syscall(SYS_capset, &hdr, data) < 0) die("Could not do capset"); /* Deny future SUID programs. */ if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) die("Could not set no_new_privs"); /* -------------------------------------------------------------- */ /* Actual program, now in sandbox */ /* -------------------------------------------------------------- */ ls("/tmp"); ls("/tmp/subdir"); touch("/tmp/hello"); touch("/tmp/subdir/hello"); return 0; } Example run: $ cc -Wall -Wextra -std=c99 -o linux-ns linux-ns.c $ ./linux-ns ==> ls [/tmp] opendir: Permission denied ==> ls [/tmp/subdir] . .. existing-file-2 ==> touch [/tmp/hello] fopen: Permission denied ==> touch [/tmp/subdir/hello] OK But it's not exactly the same. The process sees some /tmp directory, it's just not the original one. The program also didn't make sure that the process cannot write to it (it could run chmod() ), although that's just a minor problem, because this data will be discarded once the program exits. There are also probably many side effects from the new namespaces, that I didn't really explore. To further understand what's going on, the lsns command line tool is useful. To inspect the mounts of a specific process, have a look at /proc/$pid/mounts . Conclusion pledge() and unveil() are great. I hope that Linux gets something that's just as easy to use. Landlock -- or a small library on top of it -- might be just that, eventually. Being easy to use is mission critical to widespread adoption, I think, and also to correctness. The sandbox will only be as good as your understanding of it.