Rolling your own serverless OCR in 40 lines of code

A few months ago, I wanted to make my copy of Gelman’s Bayesian Data Analysis searchable for use in a statistics-focused agent.

There are some pretty sophisticated OCR tools out there but they tend to have usage limits or get expensive when you’re processing thousands of pages. DeepSeek recently released an open OCR model that handles mathematical notation well, and I figured I could run it myself if I had access to a GPU. Sadly, my daily driver is a decade-old Titan Xp which no longer supports the latest PyTorch versions and thus can’t run DeepSeek OCR.

I ended up using Modal for this.

What is Modal?

Modal is a serverless compute platform that lets you run Python code on cloud infrastructure without managing servers. The killer feature for machine learning work is that you can define a container image, attach a GPU, and pay only for the seconds your code is actually running.

import modal image = modal . Image . from_registry ( " nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04 " , add_python = " 3.11 " , ). pip_install ( " torch " , " transformers " , ...) app = modal . App ( " my-gpu-app " ) @app.function ( image = image , gpu = " A100 " ) def process_something (): # This runs on an A100 with all your deps installed pass

The decorator pattern is what makes Modal pleasant to use. You write normal Python, sprinkle decorators on the functions that need special hardware, and Modal handles the rest: building the container, provisioning the GPU, routing your requests. For OCR, this is perfect.

The OCR script

The core idea is simple: deploy a FastAPI server on Modal that accepts images and returns markdown text. Let’s walk through the important pieces.

Defining the container image

... continue reading