Tech News
← Back to articles

Chronosphere takes on Datadog with AI that explains itself, not just outages

read original related products more articles

Chronosphere, a New York-based observability startup valued at $1.6 billion, announced Monday it will launch AI-Guided Troubleshooting capabilities designed to help engineers diagnose and fix production software failures — a problem that has intensified as artificial intelligence tools accelerate code creation while making systems harder to debug.The new features combine AI-driven analysis with what Chronosphere calls a Temporal Knowledge Graph, a continuously updated map of an organization's services, infrastructure dependencies, and system changes over time. The technology aims to address a mounting challenge in enterprise software: developers are writing code faster than ever with AI assistance, but troubleshooting remains largely manual, creating bottlenecks when applications fail."For AI to be effective in observability, it needs more than pattern recognition and summarization," said Martin Mao, Chronosphere's CEO and co-founder, in an exclusive interview with VentureBeat. "Chronosphere has spent years building the data foundation and analytical depth needed for AI to actually help engineers. With our Temporal Knowledge Graph and advanced analytics capabilities, we're giving AI the understanding it needs to make observability truly intelligent — and giving engineers the confidence to trust its guidance."The announcement comes as the observability market — software that monitors complex cloud applications— faces mounting pressure to justify escalating costs. Enterprise log data volumes have grown 250% year-over-year, according to Chronosphere's own research, while a study from MIT and the University of Pennsylvania found that generative AI has spurred a 13.5% increase in weekly code commits, signifying faster development velocity but also greater system complexity.AI writes code 13% faster, but debugging stays stubbornly manualDespite advances in automated code generation, debugging production failures remains stubbornly manual. When a major e-commerce site slows during checkout or a banking app fails to process transactions, engineers must sift through millions of data points — server logs, application traces, infrastructure metrics, recent code deployments — to identify root causes.Chronosphere's answer is what it calls AI-Guided Troubleshooting, built on four core capabilities: automated "Suggestions" that propose investigation paths backed by data; the Temporal Knowledge Graph that maps system relationships and changes; Investigation Notebooks that document each troubleshooting step for future reference; and natural language query building.Mao explained the Temporal Knowledge Graph in practical terms: "It's a living, time-aware model of your system. It stitches together telemetry—metrics, traces, logs—infrastructure context, change events like deploys and feature flags, and even human input like notes and runbooks into a single, queryable map that updates as your system evolves."This differs fundamentally from the service dependency maps offered by competitors like Datadog, Dynatrace, and Splunk, Mao argued. "It adds time, not just topology," he said. "It tracks how services and dependencies change over time and connects those changes to incidents—what changed and why. Many tools rely on standardized integrations; our graph goes a step further to normalize custom, non-standard telemetry so application-specific signals aren't a blind spot."Why Chronosphere shows its work instead of making automatic decisionsUnlike purely automated systems, Chronosphere designed its AI features to keep engineers in the driver's seat—a deliberate choice meant to address what Mao calls the "confident-but-wrong guidance" problem plaguing early AI observability tools."'Keeping engineers in control' means the AI shows its work, proposes next steps, and lets engineers verify or override — never auto-deciding behind the scenes," Mao explained. "Every Suggestion includes the evidence—timing, dependencies, error patterns — and a 'Why was this suggested?' view, so they can inspect what was checked and ruled out before acting."He walked through a concrete example: "An SLO [service level objective] alert fires on Checkout. Chronosphere immediately surfaces a ranked Suggestion: errors appear to have started in the dependent Payment service. An engineer can click Investigate to see the charts and reasoning and, if it holds up, choose to dig deeper. As they steer into Payment, the system adapts with new Suggestions scoped to that service—all from one view, no tab-hopping."In this scenario, the engineer asks "what changed?" and the system pulls in change events. "Our Notebook capability makes the causal chain plain: a feature-flag update preceded pod memory exhaustion in Payment; Checkout's spike is a downstream symptom," Mao said. "They can decide to roll back the flag. That whole path — suggestions followed, evidence viewed, conclusions—is captured automatically in an Investigation Notebook, and the outcome feeds the Temporal Knowledge Graph so similar future incidents are faster to resolve."How a $1.6 billion startup takes on Datadog, Dynatrace, and SplunkChronosphere enters an increasingly crowded field. Datadog, the publicly traded observability leader valued at over $40 billion, has introduced its own AI-powered troubleshooting features. So have Dynatrace and Splunk. All three offer comprehensive "all-in-one" platforms that promise single-pane-of-glass visibility.Mao distinguished Chronosphere's approach on technical grounds. "Early 'AI for observability' leaned heavily on pattern-spotting and summarization, which tends to break down during real incidents," he said. "These approaches often stop at correlating anomalies or producing fluent explanations without the deeper analysis and causal reasoning observability leaders need. They can feel impressive in demos but disappoint in production—they summarize signals rather than explain cause and effect."A specific technical gap, he argued, involves custom application telemetry. "Most platforms reason over standardized integrations—Kubernetes, common cloud services, popular databases—ignoring the most telling clues that live in custom app telemetry," Mao said. "With an incomplete picture, large language models will 'fill in the gaps,' producing confident-but-wrong guidance that sends teams down dead ends."Chronosphere's competitive positioning received validation in July when Gartner named it a Leader in the 2025 Magic Quadrant for Observability Platforms for the second consecutive year. The firm was recognized based on both "Completeness of Vision" and "Ability to Execute." In December 2024, Chronosphere also tied for the highest overall rating among recognized vendors in Gartner Peer Insights' "Voice of the Customer" report, scoring 4.7 out of 5 based on 70 reviews.Yet the company faces intensifying competition for high-profile customers. UBS analysts noted in July that OpenAI now runs both Datadog and Chronosphere side-by-side to monitor GPU workloads, suggesting the AI leader is evaluating alternatives. While UBS maintained its buy rating on Datadog, the analysts warned that growing Chronosphere usage could pressure Datadog's pricing power.Inside the 84% cost reduction claims—and what CIOs should actually measureBeyond technical capabilities, Chronosphere has built its market position on cost control — a critical factor as observability spending spirals. The company claims its platform reduces data volumes and associated costs by 84% on average while cutting critical incidents by up to 75%.When pressed for specific customer examples with real numbers, Mao pointed to several case studies. "Robinhood has seen a 5x improvement in reliability and a 4x improvement in Mean Time to Detection," he said. "DoorDash used Chronosphere to improve governance and standardize monitoring practices. Astronomer achieved over 85% cost reduction by shaping data on ingest, and Affirm scaled their load 10x during a Black Friday event with no issues, highlighting the platform's reliability under extreme conditions."The cost argument matters because, as Paul Nashawaty, principal analyst at CUBE Research, noted when Chronosphere launched its Logs 2.0 product in June: "Organizations are drowning in telemetry data, with over 70% of observability spend going toward storing logs that are never queried."For CIOs fatigued by "AI-powered" announcements, Mao acknowledged skepticism is warranted. "The way to cut through it is to test whether the AI shortens incidents, reduces toil, and builds reusable knowledge in your own environment, not in a demo," he advised. He recommended CIOs evaluate three factors: transparency and control (does the system show its reasoning?), coverage of custom telemetry (can it handle non-standardized data?), and manual toil avoided (how many ad-hoc queries and tool-switches are eliminated?).Why Chronosphere partners with five vendors instead of building everything itselfAlongside the AI troubleshooting announcement, Chronosphere revealed a new Partner Program integrating five specialized vendors to fill gaps in its platform: Arize for large language model monitoring, Embrace for real user monitoring, Polar Signals for continuous profiling, Checkly for synthetic monitoring, and Rootly for incident management.The strategy represents a deliberate bet against the all-in-one platforms dominating the market. "While an all-in-one platform may be sufficient for smaller organizations, global enterprises demand best-in-class depth across each domain," Mao said. "This is what drove us to build our Partner Program and invest in seamless integrations with leading providers—so our customers can operate with confidence and clarity at every layer of observability."Noah Smolen, head of partnerships at Arize, said the collaboration addresses a specific enterprise need. "With a wide array of Fortune 500 customers, we understand the high bar needed to ensure AI agent systems are ready to deploy and stay incident-free, especially given the pace of AI adoption in the enterprise," Smolen said. "Our partnership with Chronosphere comes at a time when an integrated purpose-built cloud-native and AI-observability suite solves a huge pain point for forward-thinking C-suite leaders who demand the very best across their entire observability stack."Similarly, JJ Tang, CEO and founder of Rootly, emphasized the incident resolution benefits. "Incidents hinder innovation and revenue, and the challenge lies in sifting through vast amounts of observability data, mobilizing teams, and resolving issues quickly," Tang said. "Integrating Chronosphere with Rootly allows engineers to collaborate with context and resolve issues faster within their existing communication channels, drastically reducing time to resolution and ultimately improving reliability—78% plus decreases in repeat Sev0 and Sev1 incidents."When asked how total costs compare when customers use multiple partner contracts versus a single platform, Mao acknowledged the current complexity. "At present, mutual customers typically maintain separate contracts unless they engage through a services partner or system integrator," he said. However, he argued the economics still favor the composable approach: "Our combined technologies deliver exceptional value—in most circumstances at just a fraction of the price of a single-platform solution. Beyond the savings, customers gain a richer, more unified observability experience that unlocks deeper insights and greater efficiency, especially for large-scale environments."The company plans to streamline this over time. "As the ISV program matures, we're focused on delivering a more streamlined experience by transitioning to a single, unified contract that simplifies procurement and accelerates time to value," Mao said.How two Uber engineers turned Halloween outages into a billion-dollar startupChronosphere's origins trace to 2019, when Mao and co-founder Rob Skillington left Uber after building the ride-hailing giant's internal observability platform. At Uber, Mao's team had faced a crisis: the company's in-house tools would fail on its two busiest nights — Halloween and New Year's Eve — cutting off visibility into whether customers could request rides or drivers could locate passengers.The solution they built at Uber used open-source software and ultimately allowed the company to operate without outages, even during high-volume events. But the broader market insight came at an industry conference in December 2018, when major cloud providers threw their weight behind Kubernetes, Google's container orchestration technology."This meant that most technology architectures were eventually going to look like Uber's," Mao recalled in an August 2024 profile by Greylock Partners, Chronosphere's lead investor. "And that meant every company, not just a few big tech companies and the Walmarts of the world, would have the exact same problem we had solved at Uber."Chronosphere has since raised more than $343 million in funding across multiple rounds led by Greylock, Lux Capital, General Atlantic, Addition, and Founders Fund. The company operates as a remote-first organization with offices in New York, Austin, Boston, San Francisco, and Seattle, employing approximately 299 people according to LinkedIn data.The company's customer base includes DoorDash, Zillow, Snap, Robinhood, and Affirm — predominantly high-growth technology companies operating cloud-native, Kubernetes-based infrastructures at massive scale.What's available now—and what enterprises can expect in 2026Chronosphere's AI-Guided Troubleshooting capabilities, including Suggestions and Investigation Notebooks, entered limited availability Monday with select customers. The company plans full general availability in 2026. The Model Context Protocol (MCP) Server, which enables engineers to integrate Chronosphere directly into internal AI workflows and query observability data through AI-enabled development environments, is available immediately for all Chronosphere customers.The phased rollout reflects the company's cautious approach to deploying AI in production environments where mistakes carry real costs. By gathering feedback from early adopters before broad release, Chronosphere aims to refine its guidance algorithms and validate that its suggestions genuinely accelerate troubleshooting rather than simply generating impressive demonstrations.The longer game, however, extends beyond individual product features. Chronosphere's dual bet — on transparent AI that shows its reasoning and on a partner ecosystem rather than all-in-one integration — amounts to a fundamental thesis about how enterprise observability will evolve as systems grow more complex.If that thesis proves correct, the company that solves observability for the AI age won't be the one with the most automated black box. It will be the one that earns engineers' trust by explaining what it knows, admitting what it doesn't, and letting humans make the final call. In an industry drowning in data and promised silver bullets, Chronosphere is wagering that showing your work still matters — even when AI is doing the math.