Representing Python notebooks as dataflow graphs

This blog is adapted from our talk at PyCon 2025. marimo is free and open source, available on GitHub. For a free online experience with link sharing, try molab.

marimo is a new kind of open-source Python notebook. While traditional notebooks are just REPLs, marimo notebooks are Python programs represented as dataflow graphs. This intermediate representation lets marimo blend the best parts of interactive computing with the reproducibility and reusability of Python software: every marimo notebook works as a reactive notebook for Python (and SQL) that keeps code and outputs in sync (run a cell and marimo knows which other cells to run), an executable script and a Python module, and an interactive web app.

In this blog post, we motivate the need for a new kind of notebook, explain how and why marimo represents notebooks as dataflow graphs, present design decisions we made to help users adapt to dataflow programming, discuss our implementation, and show several examples of how marimo uses its dataflow graph to make data work reactive, interactive, reproducible, and reusable.

Motivating a new kind of notebook

AI and data work are different from software engineering. When you work with data, it’s helpful to hold objects in memory, iteratively transforming and visualizing them as you evaluate datasets, models, and algorithms. Notebooks are the only programming environment that enable this workflow — for this reason, the number of GitHub repos with Jupyter notebooks more than doubled in 2024, alongside the rise of AI.

Interactive computing is crucial for evaluating data and algorithms. These GIFs visualize projected LBFGS applied to different embedding formulations on the MNIST dataset. I made them in Jupyter notebooks during my PhD in machine learning; I used Jupyter notebooks extensively during my PhD, but I also had issues in reproducibility, interactivity, maintainability, and reusability, which motivated me to start work on marimo.

Nonetheless, traditional notebooks such as Google Colab or Jupyter notebooks suffer from issues in reproducibility, interactivity, maintainability, and reusability, making them unsuited for modern AI and data work and creating the need for a new kind of notebook.

Reproducibility

Traditional notebooks have a reproducibility crisis. In 2019, a study from New York University and Federal Fluminense University found that of the nearly 1 million Jupyter notebooks on GitHub with valid execution orders, only 24% could be re-run, and just 4% reproduced the same results. A similar study from 2020 by JetBrains found that over a third of the notebooks on GitHub had invalid execution histories.

Traditional notebooks accumulate hidden state: run or delete a cell, and program memory is imperatively mutated without regard to the rest of the code on the page. This allows code and outputs to become out of sync. While a feature of REPLs, hidden state gets in the way when you’re trying to do work you’d like to reproduce.

... continue reading