Introduction
In the last few years, the field of time-series forecasting has seen a fundamental shift. Where we once depended solely on classic statistical methods, think ARIMA, SARIMA, and Prophet, new “foundation” models have emerged, promising to bring the power and flexibility of large language models (LLMs) into the world of time-series data. The allure is obvious: can we build a single, reusable forecasting model that works across a variety of datasets and domains, instead of painstakingly training a new model for every scenario?
Parseable is built to handle our users’ observability data at any scale, a nonstop stream of raw ingest counts, infrastructure vitals, and fine-grained application signals. Running a separate, hand-tuned forecasting model for every slice quickly turns into a treadmill: each new stream or workload tweak demands fresh hyper-params, retrains, and ever-growing config sprawl. All that manual churn slows forecasts and breeds drift, so the results never feel fully trustworthy.
Then came the rise of foundation models, which revolutionised natural language processing by offering strong zero-shot and transfer learning capabilities. Researchers began asking a natural question: if LLMs can generalise to new tasks with minimal retraining, could similar techniques be applied to time-series data? What if you could just hand any telemetry stream to a pre-trained foundation model and immediately get a high-quality forecast, regardless of whether the model had seen data from that source before?
Motivated by this possibility, we set out to benchmark a new generation of time-series foundation models, Amazon Chronos, Google TimesFM, IBM Tiny Time-Mixers, and Datadog Toto. Our goal was to assess how well these models perform on two representative tasks: a relatively straightforward forecasting problem (predicting ingestion volumes) and a more complex multivariate problem (forecasting multiple pod-level metrics). Along the way, we compared them to classical baselines and took note of both practical and technical trade-offs.
This post details our methodology, the challenges we encountered, how we evaluated the models, and what we learned from putting foundation models to the test on real-world observability data.
Why Foundation Models?
The idea of “foundation models” has fundamentally changed how we approach complex machine learning problems. In natural language processing, models like GPT have shown that a single, large model trained on vast and diverse datasets can generalize well to entirely new tasks sometimes even without fine-tuning. This zero-shot capability means a single model can perform sentiment analysis, summarization, translation, or question-answering, just by changing the prompt.
In the world of time-series forecasting, the appeal of such flexibility is obvious, especially for modern data engineering and observability platforms. Traditionally, every new data stream whether it’s CPU utilization, request rates, or disk I/O required its own model, hyperparameter tuning, and regular retraining. For an SRE or platform engineer, this quickly becomes unmanageable as the number of streams explodes. If a pipeline ingests data from a hundred microservices, does every service metric really need its own hand-tuned ARIMA or Prophet model? The answer, up until recently, was “yes.”
Foundation models for time series are built to change that. The core motivation is scalability and adaptability: train a single, large model (often with billions of parameters) on a wide range of time-series datasets and let it learn the underlying “language” of temporal data. Once trained, this model should ideally handle a completely new telemetry stream, even if it has never seen data of that exact shape or domain before. In theory, you could input any new time series, whether it’s network packet counts, database query durations, or energy consumption readings and get a high-quality forecast without retraining.
... continue reading