Tech News
← Back to articles

Why German Strings Are Everywhere?

read original related products more articles

German Strings

Strings are conceptually very simple: It’s essentially just a sequence of characters, right? Why, then, does every programming language have their own slightly different string implementation? It turns out that there is a lot more to a string than “just a sequence of characters”.

We’re no different and built our own custom string type that is highly optimized for data processing. Even though we didn’t expect it when we first wrote about it in our inaugural Umbra research paper, a lot of new systems adopted our format. They are now implemented in DuckDB, Apache Arrow, Polars, and Facebook Velox.

In this blog post, we’d like to tell you more about the advantages of our so-called “German Strings” and the tradeoffs we made.

But first, let’s take a step back and take a look at how strings are commonly implemented.

How C does it

In C, strings are just a sequence of bytes with the vague promise that a \0 byte will terminate the string at some point.

This is a very simple model conceptually, but very cumbersome in practice:

What if your string is not terminated? If you’re not careful, you can read beyond the intended end of the string, a huge security problem!

Just calculating the length of the string forces you to iterate over the whole thing.

... continue reading