Scaling our observability platform by embracing wide events and replacing OTel

TLDR # Observability at scale: Our internal system grew from 19 PiB to 100 PB of uncompressed logs and from ~40 trillion to 500 trillion rows. Efficiency breakthrough: We absorbed a 20× surge in event volume using under 10% of the CPU previously needed. OTel pitfalls: The required parsing and marshalling of events in OpenTelemetry proved a bottleneck and didn’t scale - our custom pipeline addressed this. Introducing HyperDX: ClickHouse-native observability UI for seamless exploration, correlation, and root-cause analysis with Lucene-like syntax.

About a year ago, we shared the story of LogHouse - our internal logging platform built to monitor ClickHouse Cloud. At the time, it managed what felt like a massive 19 PiB of data. More than just solving our observability challenges, LogHouse also saved us millions by replacing an increasingly unsustainable Datadog bill. The response to that post was overwhelming. It was clear our experience resonated with others facing similar struggles with traditional observability vendors and underscored just how critical effective data management is at scale.

A year later, LogHouse has grown beyond anything we anticipated and is now storing over 100 petabytes of uncompressed data across nearly 500 trillion rows. That kind of scale forced a series of architectural changes, new tools, and hard-earned lessons that we felt were worth sharing - not least that OpenTelemetry (OTel) isn’t always the panacea of Observability (though we still love it), and that sometimes custom pipelines are essential.

In our case, this shift enabled us to handle a 20x increase in event volume using less than 10% of the CPU for our most critical data source - a transformation with massive implications for cost and efficiency.

Other parts of our stack have also changed, not least due to the ClickHouse acquisition of HyperDX. Not only did this give us a first-party ClickHouse-native UI, but it also led to the creation of ClickStack - an opinionated, end-to-end observability stack built around ClickHouse. With HyperDX, we’ve started transitioning away from our Grafana-based custom UI, moving toward a more integrated experience for exploration, correlation, and root cause analysis.

As more teams adopt ClickHouse for observability and realize just how much they can store and query affordably, we hope these insights prove as useful as our first post. If you’re curious about this journey, when and where OTel is appropriate, and how we scaled a log pipeline to 100PB…read on.

Get started today Interested in seeing how ClickHouse works on your data? Get started with ClickHouse Cloud in minutes and receive $300 in free credits. Sign up

Beyond general purpose: evolving observability at scale #

Over the past year, our approach to observability has undergone a significant transformation. We've continued to leverage OpenTelemetry to gather general-purpose logs, but as our systems have scaled, we began to reach its limits. While OTel remains a valuable part of our toolkit, it couldn't fully deliver the performance and precision we needed for our most demanding workloads. This prompted us to develop purpose-built tools tailored to our critical systems and rethink where generic solutions truly fit. Along the way, we've broadened the range of data we collect and revamped how we present insights to engineers.

A new frontier of scale #

... continue reading