Note 1: On Tower of Hanoi Solutions and their Complexity. I chose the Tower of Hanoi puzzle (Lucas, 1883) because of its almost mythical status in computer science and discrete mathematics communities. It’s a staple in AI education and typically the first encounter with elegant doubly recursive algorithms for CS undergraduates. And, I chose the search algorithms mentioned in Section 1 because they constitute the core of the “state space search” paradigm in most AI textbooks (e.g., Chapters 3 and 4 in AIMA). Given our focus on AI-assisted code development, these two choices go hand in hand, and are a great way to erect the echo chamber I alluded to in the introduction. However, well-informed readers may note that optimal closed-form solutions exist for the Tower of Hanoi and some of its variants (Bode & Hinz, 1999 - Hinz et al, 2016). One need not always brute-force the state space of the problem (which has size 3^n for n disks and 3 pegs). This includes our variant (a) — arbitrary start and end configurations — which in Hinz (1992) terminology is a “p2”-type problem. For such problems, one (sometimes two) geodesic in the n-th order Sierpiński triangle graph associated with the problem (Hinz & Schief, 1990) can be constructed by a deterministic optimal algorithm (Romik, 2006). This algorithm, in search parlance, has branching factor 1 and takes linear time in the number of moves — though the total runtime may still be exponential since the graph diameter is 2^n -1. Research also exists on the asymptotic length of average shortest-path length between random configurations (Chan, 1988 - Hinz & Schief, 1990), which is exactly the measure our solver approximates with the “ -r N -p K ” switches, for large K . Still, our variant (b)—multiple disks liftable at once—does not enlarge the search space but makes it more densely connected, so the shortest path gets shorter, on average, and the solutions built by these traditional deterministic algorithms are no longer guaranteed to be optimal. I don’t know of any closed-form optimal solution for this variant, nor for the even more complex a+b combination our solver is built to address. So even in the ancient town of Hanoi there seems to be actual work to be done by general purpose search algorithms, after all! Note 2: On LLM Limitations for Combinatorial Problem Solving. Given how powerful these AI assistants are, one may wonder: Couldn’t we use them directly to solve these Hanoi puzzles? Why write a solver as an intermediate step? The easy answer is that the emphasis here is on studying AI/human interaction in coding tasks, independent of whether the AI assistant can or can’t address the same problems as the software being developed. A more insightful answer is that these LLM-based AI assistants, even in their most powerful “reasoning” incarnations (LRMs, for Large Reasoning Models), can’t solve exactly, on their own, all instances of most classes of problems, for theoretical reasons rooted in computational complexity. In particular, vanilla LLMs primarily learn reasoning patterns implicitly through next-token prediction; the result is a limited formal reasoning ability that fails quickly on puzzles of even modest size. LRMs are designed and trained to engage in more systematic reasoning processes: They extend their reasoning capabilities through a variety of methods (Chain-of-Thought, Tree-of-Thought, Step-by-Step decomposition, Beam Search, Backtracking, Iterative Refinement, and many more). If one is willing to allocate arbitrary time and resources to them (which is not true in LRMs), they may be able in principle to mimic a sound and complete algorithm and solve larger and larger instances. But even with unlimited resources, LRMs would incur massive computational overhead, compared to an optimal, dedicated algorithmic solution, and this would make all but the smallest instances unsolvable, for some measure of “small” (or, more frequently, assigned a plausible yet wrong solution). Indeed, the most efficient approach for an AI assistant to solve correctly large puzzles (especially a variant like our a+b) is to invoke an external, dedicated reasoner or to code a piece of software capable of solving it. This is true not only of our Hanoi concoction, but of all puzzles with a strong combinatorial core (e.g., NP-hard problems). These limits of LLMs and LRMs have been studied theoretically and showcased experimentally for a variety of problems/puzzles — see, e.g., (Shojaee et al., 2025), (Kambhampati et al., 2024), (Fan et al., 2024), (Hazra et al., 2025), (Lin et al., 2025). Note 3: On the Presence of Ambiguity in Programming Languages. Section 11 is not meant to suggest that ambiguity could not — and did not — percolate into traditional programming frameworks: it certainly did! And the culprit typically was… the use of natural language to specify (some aspects of) artificial languages! To name just one renowned example among hundreds, the reference syntax and semantics of the original version of the C language were specified mostly in English by Kernighan and Ritchie from Bell Labs in their book (K&R, 1978). I read the first edition in the mid-80s, and it was as gripping as any good novel. Those who implemented early C compilers — whose code and behavior embodied the actual syntax and semantics of the language — were left by the prose with room for interpretation and some degrees of freedom. As a result, the chances that different C compilers on different platforms (PDP-11, VAX, UNIX clones, MS-DOS), or even two different compilers on the same platform (e.g., Borland Turbo C vs Microsoft C on MS-DOS), would accept the same syntax for any non-trivial piece of code and produce equivalent machine instructions was quite low. Several undefined behaviors persisted that could alter computational results; edge cases were almost always treated differently by different compilers; code portability was poor; and programmers were never really sure what their code ultimately did or meant before compiling and testing it on a specific platform with a specific compiler. It took ANSI a decade to identify all ambiguities and specify/standardize everything with the C89 standard. Over time, the room for ambiguity in new programming language specifications has been largely eliminated, but not entirely. For example, try dividing 5 by 2 in Python — a modern language that is the target of this vibe coding experiment — and you’ll realize that the result (2? 2.5? A warning?) depends on both the Python version (2.x vs 3.x) and the specific implementation (e.g., CPython vs Jython vs PyPy). This occurs because it was left underspecified what happens when you divide two constants that may represent integers or may be implicitly cast to floats. Of course, we’re comparing apples to oranges here: these kinds of ambiguities “probabilistically generate tools” that speak different dialects, but each one is fully deterministic and reliably consistent; whereas AI coding agents based on LLMs “generate probabilistic tools”. Languages, you know: flipping two words changes the meaning entirely!