Moving beyond Software 3.0's generate-and-verify loop, AI Functions execute LLM-generated code at runtime, return native Python objects, and use automated post-conditions for continuous verification. This is Software 3.1: where AI doesn't just write code—it runs it.
Software 3.1? - AI Functions
Andrej Karpathy has a version numbering scheme for how software gets written. Software 1.0 is code written by humans. Software 2.0 is neural network weights learned through optimization. Software 3.0 is prompting LLMs in plain language, and sounds nicer than calling it vibe coding, which, fun-fact is a also a Karpathy coined term.
Of course, Software 3.0 is real. Millions of people are using it daily. Tools like Kiro, Cursor, Claude Code, and ChatGPT let you describe what you want and get code back. Karpathy emphasizes a ‘generation–verification loop’ in partial-autonomy tools: the model generates changes, a human verifies them, and the work iterates.
But there’s something more fundamental going on than who reviews what. Look at what the LLM actually produces in Software 3.0: text. Code as strings. JSON payloads. Markdown documents. The model generates, you receive text, and then you do everything else – integrate it into your codebase, write tests, run CI, deploy. If you’re disciplined about verification, you write test cases, but those run before deployment. Once the code ships, the tests don’t execute again. The LLM’s involvement ends when it hands you the output. Your running software has no relationship with the model that helped write it.
Now consider a different arrangement. The LLM generates code that actually runs inside your application – at call time, every time the function is invoked. It returns native Python objects – DataFrames, Pydantic models, database connections – not JSON strings you have to parse. And verification isn’t a gate you pass before deployment; it’s post-conditions that execute on every call, feeding failures back to the model for automatic retries. This changes three things at once: where AI fits in your software (runtime, not just development time), what it produces (live objects you can call methods on, not serialized text), and how you trust it (continuous automated verification, not one-time human review).
That’s the experiment at the heart of AI Functions, a new project from Strands Labs built on the Strands Agents SDK. You write a Python function with a natural language specification instead of implementation code. You attach post-conditions – plain Python assertions that define what correct output looks like. When the function is called, the LLM generates code, executes it in your Python process, returns the result as a native object, and the post-conditions verify it. If verification fails, the system retries with the error as feedback. The human never inspects the generated code. The post-conditions do the inspecting – every time.
If Software 3.0 is “human prompts, LLM generates, human verifies,” then I propose that AI Functions are Software 3.1: human specifies, LLM generates and executes, machine verifies – at runtime. Same paradigm – natural language as the programming interface. But the execution model is different. The LLM isn’t producing text for a human to integrate. It’s producing code that runs, returning objects your application uses directly, verified by post-conditions on every call. Software 3.1 is a “point release,” not a major version bump. The upgrade is in what happens after generation.
This post is a deep dive into what AI Functions are, how they work, and what automated verification makes possible.
What AI Functions Are
... continue reading