Tech News
← Back to articles

Why Semantic Layers Matter (and how to build one with DuckDB)

read original related products more articles

Many ask themselves, "Why would I use a semantic layer? What is it anyway?" In this hands-on guide, we’ll build the simplest possible semantic layer using just a YAML file and a Python script—not as the goal itself, but as a way to understand the value of semantic layers. We’ll then query 20 million NYC taxi records with consistent business metrics executed using DuckDB and Ibis. By the end, you’ll know exactly when a semantic layer solves real problems and when it’s overkill.

It's a topic that I'm passionate about as I've been using semantic layers within a Business Intelligence (BI) tool for over twenty years, and only recently have we gotten full-blown semantic layers that can sit outside of a BI tool, combining the advantages of a logical layer with sharing them across your web apps, notebooks, and BI tools. With a semantic layer, your revenue KPI or other complex company measures are defined once in a single source of truth—no need to re-implement them over and over again.

We'll have a look at the simplest possible semantic layer, which uses a simple YAML file (for the semantics) and a Python script for executing it with Ibis and DuckDB. We'll do a quick recap of the semantic layer before diving into a practical code example.

NOTE : Find the Code on GitHub For all the examples shown in this article, find the code on the GitHub repository For all the examples shown in this article, find the code on the GitHub repository semantic-layer-duckdb

When You Don't Need a Semantic Layer Let's start by exploring when you don't need a semantic layer and when it's the wrong choice. The simplest and most straightforward reasons are: You're just getting started with analytics and only have one consumer, meaning you only have one way of showcasing analytics data, for example, a BI tool, notebooks, or a web app, but not multiple ways of presenting data. This means you don't apply calculated logic in different places.

You don't have extensive business logic that you query ad hoc; you have simple counts, SUMs, or averages.

You preprocess all your metrics as SQL transformations into physical tables, meaning your downstream analytics tools get all metrics preprocessed and aggregated, and filtering is fast enough.

Why Use a Semantic Layer? So when do we actually need one, and what is it? There's a lot of information out there, including from myself about the history and rise [2022], comparing it to an MVC-like approach, or explaining its capabilities. That's why in this article I focus on the why and showcase how to use it in a practical example in the next chapter. The main reasons for using a semantic layer may be one or more of the following needs: Unified place to define ad hoc queries once, version-controlled and collaboratively, with the possibility of pulling them into different BI tools, web apps, notebooks, or AI/MCP integration. Avoid duplication of metrics in every tool, making maintainability and data governance much easier; resulting in a consistent business layer with encapsulated business logic. Example: Most organizations quickly run multiple BI tools simultaneously with additional Excel or Google Sheets. Instead of maintaining separate calculated fields and business logic in each tool in a proprietary format, semantic layers provide one definition that works across all platforms. Caching is needed for ad hoc queries that are based on various source databases. Defining the metrics that enable pre-calculations for sub-second query responses can benefit any downstream analytics tools compared to implementing custom database connections and different databases. Eliminating potential data movement costs by querying data where it lives, using dialect-optimized SQL pushdown across heterogeneous sources. This reduces infrastructure overhead and cloud computing costs. Example: For a non-production or high-load OLTP source, the semantic layer can directly query the various data sources (e.g., IoT data, logs, and other data) instead of moving them into a data lake or data warehouse, and through the cache of the semantic layer, it's fast enough without data movement. Unified access-level security through various APIs (REST, GraphQL, SQL, ODBC/JDBC, MDX/Excel) as well. Unified Analytics API enables self-serve BI by allowing users to connect Excel to a cleaned, fast, and unified API. Example: Centralized row-level and column-level security that works consistently across all downstream analytics tools, rather than trying to manage access controls separately in each BI tool or analytics tool that has access to the data. Users can connect directly with Excel and have the correct permissions and calculated business metrics out of the box. Dynamic query rewriting automatically translates simple, business-friendly queries into complex, optimized SQL across multiple databases. This enables users to write intuitive queries using business concepts (like "average_order_value") without needing to know the underlying data model complexity, table relationships, or database-specific syntax. The semantic layer abstracts complex analytics, such as ratios at different grains, time ranges (YoY, trailing periods), and custom calendars, into simple semantic queries. Example: Complex analytics simplified by handling sophisticated calculations that are painful in raw SQL: ratios at different grains (like per-member-per-month in insurance), time intelligence (year-over-date, trailing 12 months, period-over-period), and custom calendar logic. These become simple semantic queries rather than complex subqueries with distinct counts. Context for LLMs to improve accuracy and natural language querying can be significantly enhanced with a semantic layer, which provides business context and prevents AI from hallucinating frequently, as most of the business logic is configured and defined in a semantic layer, sometimes even data models, to help LLMs further understand the business. Example: Internal Large Language Models (LLMs) or Retrieval-Augmented Generation (RAG) systems need business context to understand the business. A semantic layer's connection of dimensions and facts, along with metric definitions, can help the model understand and suggest better SQL queries or responses through natural language. More broadly, semantic layers bridge the gap between business needs and data source integration in a very organized and governed way. They are best optimized for larger enterprises with numerous scattered KPIs that can afford to add another layer to their data stack. However, the example below uses the simplest and smallest semantic layer, even with little data. EXAMPLE : If you want to know more Brian Bickell gave a great talk at the Practical Data Community about semantic layers and the problem they solve. I highly recommend checking that out too at Brian Bickell gave a great talk at the Practical Data Community about semantic layers and the problem they solve. I highly recommend checking that out too at Semantic Layer Deep Dive . If you're already on the Practical Data show from Joe Reis, also check out Hamilton Ulmer's presentation about Instant SQL with DuckDB/MotherDuck , not entirely about semantic layers, but related to the history of SQL and CTEs and how instant SQL can help. Datasets vs. Aggregations An important distinction is whether we need persistent datasets or we want ad hoc queries. These are typically very different. Ad hoc queries must be flexible and change granularity based on added dimensions. This means someone running a query might switch from a daily view to a weekly or monthly one, add a region, and then decide to roll it up to a country level; all of this can happen in a couple of seconds. Therefore, there is no time to refresh or process the data. Calculated measures need to be added on the fly, without requiring an ETL job to be reprocessed. A common workaround is to create multiple persistent physical datasets with dbt, each containing the same data but with varying granularity, allowing for the display of different charts in the BI tool with different focuses. A semantic layer, or ad hoc queries, does that on the fly. We can differentiate and say: dataset ≠ aggregations

table columns ≠ metrics

physical table ≠ logical definition If you find yourself needing the concepts on the right side, that's when you need a semantic layer—whether built into a BI tool or implemented separately for the reasons mentioned above.

... continue reading