How we made WINDOW JOIN parallel and vectorized

QuestDB is the open-source time-series database for demanding workloads—from trading floors to mission control. It delivers ultra-low latency, high ingestion throughput, and a multi-tier storage engine. Native support for Parquet and SQL keeps your data portable, AI-ready—no vendor lock-in.

Consider a workload that comes up constantly on a trading desk: for every executed trade, attach the average bid and ask within a 1-second window around the trade. Without a dedicated operator it takes two joins, an ASOF JOIN for the carry-forward quote at the window start plus a range join for the rows inside the window, stitched with UNION ALL and folded with a GROUP BY:

-- QuestDB timestamps are microseconds, so 1_000_000 is 1 second. WITH prevailing AS ( -- ASOF-match against the window start (trade timestamp - 1 s), -- not the trade timestamp itself. SELECT t.orig_ts ts, t.symbol, p.bid, p.ask FROM ( (SELECT (timestamp - 1000000) AS ts, symbol, timestamp AS orig_ts FROM trades) TIMESTAMP(ts) ) t ASOF JOIN prices p ON p.sym = t.symbol ), in_window AS ( SELECT t.timestamp ts, t.symbol, p.bid, p.ask FROM trades t JOIN prices p ON p.sym = t.symbol WHERE p.ts > t.timestamp - 1000000 AND p.ts <= t.timestamp + 1000000 ) SELECT ts, symbol, avg(bid) avg_bid, avg(ask) avg_ask FROM (SELECT * FROM prevailing UNION ALL SELECT * FROM in_window) GROUP BY ts, symbol;

This works, but it's a lot of SQL for a simple operation. The ASOF JOIN and the range JOIN walk the prices table independently even though they are answering two halves of the same question, and the range JOIN forces the planner to hash on sym and then re-filter every matched pair against the BETWEEN predicate. The outer GROUP BY over ts is a hash aggregation that has to materialize a row per (ts, symbol) pair, which works out to 50 million groups in our test data. There is nothing here for the optimizer to fuse, parallelize cleanly, or vectorize.

WINDOW JOIN is QuestDB's dedicated syntax for aggregating one table over a time window around each row of another. The same query, dedicated operator:

SELECT t.*, avg(p.bid) avg_bid, avg(p.ask) avg_ask FROM trades t WINDOW JOIN prices p ON p.sym = t.symbol RANGE BETWEEN 1 second PRECEDING AND 1 second FOLLOWING;

Now the operator knows what it is doing: for every row on the left-hand side of the join (LHS - trades here), find rows on the right-hand side (RHS - prices ) whose timestamp falls inside a [lo, hi] window around the LHS timestamp, restrict to matching symbol keys, and reduce them with a batch of aggregate functions.

Making that fast comes down to two pieces: data-level parallelism over the LHS, plus a low-cardinality fast path that copies values into contiguous buffers so the SIMD aggregation kernels we already ship for SAMPLE BY run on window slices unchanged. Benchmarked against Timescale, DuckDB, and ClickHouse on a 50M-row trades table joined to a 150M-row prices table, the parallel + SIMD path runs 5.0x faster than QuestDB's own single-threaded fallback and 25x faster than ClickHouse's best rewrite.

QuestDB stores data in append-only column files, partitioned by time. The query engine reads them as a sequence of page frames: contiguous, columnar slabs of memory that map directly onto file pages. Filtering and aggregation both work at this granularity: a page frame is the unit of dispatch to a worker thread.

WINDOW JOIN follows the same model. The LHS table is sliced into page frames; each worker takes a frame and is responsible for producing the aggregate result for every LHS row in that frame. To do that it needs the RHS rows that fall inside the union of all windows the frame covers.

... continue reading