Skip to content
Tech News
← Back to articles

I taught a bucket to speak Git

read original more articles
Why This Matters

This article explores an innovative approach to integrating Git with object storage buckets, demonstrating that Git's core components can be mapped onto object storage systems. This has significant implications for the tech industry, potentially enabling more scalable, flexible, and cost-effective version control solutions for cloud-native applications and large-scale data management. For consumers, this could translate into more seamless, cloud-based development workflows and storage options.

Key Takeaways

What happens if I just point a git server at an object storage bucket?

Back when I was porting agent sandboxes to Go, I built everything on top of billy, a filesystem abstraction for Go. The whole trick of the project was teaching a Tigris bucket to act enough like a filesystem that a shell interpreter and its tools couldn’t tell the difference. Billy was the key layer that made the entire façade fall into place.

After I had gotten things working, I learned that I’m using billy way outside its normal usecase. It was originally made for go-git, a pure-Go implementation of git’s protocols and data formats. It doesn’t rely on the /usr/bin/git binary existing at all. Every method on billy’s filesystem interface exists purely because go-git needs it. This gave me a terrible idea: I already have a bucket that can quack like a filesystem and go-git’s native language is “filesystem”.

Can this Just Work™? Let's find out.

If you strip away the porcelain, a git repository is 4 basic things:

Objects, or compressed blobs of data. Most of the objects in any individual repository are files.

Trees, or objects that map to other objects. TL;DR: trees are folders.

Commits, or objects that point at one tree and their parent commit. This lets you pin down which files belong to one logical change set.

Refs, branches and tags, they are tiny mutable pointers into the pile of objects.

note Until I started working on this I was under the impression that git stored only the patches done to an empty folder and that was how it reconstructed the history of your repository. It does not. It actually keeps track of the entire files, which explains why big binary blobs fudge the tooling so much. The diff mental model works fine for using git day to day; it’s just wrong at the storage layer, which is the layer this post lives in.

... continue reading