I teamed up two AI tools to solve a major bug - but they couldn't do it without me

monsitj/iStock/Getty Images Plus Follow ZDNET: Add us as a preferred source on Google. ZDNET's key takeaways Codex struggles with big-picture debugging in complex codebases. Deep Research excels at diagnosis when code context is large. Human testing and oversight still remain critical with AI coding. "Huh?!?" Sometimes, when I'm coding and something doesn't behave quite right, and I'm not entirely sure what's up, my brain fires off an internal "Huh?!?" I think it's my way of recognizing "there be dragons" but without escalating into a full-tilt panic loop. A few days into my AI-coding productizing process, after my four-day uber-performance AI-assisted programming sprint, something wasn't quite right. At first, it didn't seem terribly wrong (which was be a misjudgment because it actually was). Also: I got 4 years of product development done in 4 days for $200, and I'm still stunned I eventually solved the problem using both OpenAI's Codex and ChatGPT Deep Research. That proved to be a necessary team-up. I'll explain why in short order. But first, let's deconstruct the "Huh?!"" Is it even a bug? This all took place after my big coding sprint. I built four add-on products for my security product. Once the main coding was done, there was still a lot of work, both on the marketing and documentation side and on the distribution and operations side. One major task was testing. After that, I had to zip it all up so my online store could distribute installable plugin packages to my users. Also: 10 ChatGPT Codex secrets I only learned after 60 hours of pair programming with it It was here that I noticed something odd. Clicking on the WordPress dashboard proved unresponsive for 15-20 seconds. But this only occurred after I switched away from the WordPress dashboard of my development environment for some number of hours. My first access in the morning locked up for about a quarter of a minute. But after that, it behaved just fine. The time away seemed to have to be fairly long before the behavior would manifest again. It would lock up again if I came back to it after going off to do something else, like write an article. I wasn't even sure this was a bug in my code. It could have been something about my system, or the build, or WordPress, or even just my imagination. Trying to diagnose the issue (part 1) I tried describing the problem to Codex, but because I wasn't even sure it was a problem, I wasn't giving it the best guidance. Codex wasn't able to shed any light on the matter. Also: The best AI for coding in 2025 (including a new winner - and what not to use) So I had Codex build a diagnostic platform that measured every behavior when WordPress started up. I had it catch every hook, every call, every time delay, and record it in a diagnostics console. Unfortunately, nothing particularly notable was recorded in the diagnostics telemetry. Making matters worse, I really only had a once-a-day chance to catch anything because the only reliable manifestation of the problem was the first time I tried to use my test site at the beginning of the day. Nothing. Nada. Zip. Zilch. Oh, yeah, it's definitely a bug After working on this for a few days, I decided it wasn't really a bug. It was just some minor manifestation of my development environment. I moved on to recording the tutorial videos for each of the plugins. Those of you who have been following along on my AI coding saga will recall that the first product I generated using Codex was a site analysis tool. It captures events happening to your site (failed logins, AI bot visits, search spiders, etc.) and presents both the raw data and a clear analysis. Also: I did 24 days of coding in 12 hours with a $20 AI tool - but there's one big pitfall Demoing this on my development environment wasn't particularly interesting because the data would, of necessity, only be test data. So I decided to put the tool up on my main user-facing server, the one used to support and sell these products. That server gets quite a bit of traffic, making it a good test case for a visitor analysis tool like mine. I also needed to put the main security product up there because the visitor tool is an add-on to it. So I installed the latest build. Bad move? Good move? Bad result. It was a bad move because it made my site unusable. Clicking on anything required a minute or more wait before something happened. It wasn't only in the admin dashboard. Users visiting the front end of the site experienced the same behavior. It was a good move, because it became immediately obvious that I had a bug. This was not some minor thing that manifested when my test site slept overnight. No, on an active site, it rendered the site completely unusable. But the result was bad, because it also became immediately obvious that there was no way I could ship the AI-created update to my users. Freezing my own site is one thing. Freezing 20,000 other sites on the internet? That would be very bad. Also: The best free AI courses and certificates in 2025 - and I've tried many The site was so slow that I couldn't access the plugin dashboard to disable it. I had to log in using my hosting provider's file manager and delete it from outside of WordPress. Doing so immediately restored the site to proper operation, making it abundantly clear that the problem was with my updated security plugin. Trying to diagnose the issue (part 2) I reported this observation back to Codex. Since I started my big coding sprint using Codex on the ChatGPT Plus plan and then the Pro plan, Codex has proven to be surprisingly adept at debugging. But not this time. As I discussed in "10 ChatGPT Codex secrets I only learned after 60 hours of pair programming with it," I have concluded that Codex doesn't work well with large assignments. Have you ever gone to the drive-through for a fast food place that's normally really reliable for your usual one-meal order, but this time you placed a really big order? That fast food place that seems operationally solid on single-meal orders almost always screws up big orders, especially if there are any special requests. Also: Your colleagues are sick of your AI workslop Codex also doesn't handle big orders well. They invariably come back as a useless mess. I've been enormously successful in the past when breaking down large projects by doing one piece at a time, but that didn't work for this problem. This was a systems problem. Something about my entire codebase, since Codex began working on it, caused freezes. I couldn't point Codex at one small area and tell it to work and then wait for it to come up with an answer. It had to look at everything. I tried to tell it that the problem manifested since it started working on my code, but it didn't recall when that was. Codex knows only the current session and anything passed along to it on purpose between sessions. But it has no real memory of what it did, so it doesn't have much of a framework to look at what it might have broken. I gave it about 20 different prompts. Each time, it went away to think. It was starting to feel like it does when you're buying a car and the salesman has to run back to "discuss it with his manager" for each step of the way. Codex needed to go away and think for 5 or 10 minutes, and then come back with what was invariably a useless or nonfunctional "fix." Also: AI magnifies your teams strengths - and weaknesses, Google report finds I was very frustrated. I knew that I could go back into the code myself and try to diagnose the issue. I've acquired a bunch of products where I absorb other peoples' code and develop an understanding of it, so it's a skill I do have. But I also knew that doing so would mean I was embarking on what would probably be weeks of frustrating work that would essentially eliminate all the productivity gains I had achieved from doing the pair programming with the AI. There had to be a better way. Enter ChatGPT Deep Research If Codex is fairly terrible at big picture work, ChatGPT's Big Research specializes in it. I decided to give the problem to Big D. Deep Research has access to the GitHub repo that contains my project, so the logistics of examining the code was no problem. I explained the problem and set it loose. Also: AI is every developer's new reality - 5 ways to make the most of it About a half hour later, it came back. It blamed all my original code. It had a laundry list of places in my original code where there could be very minor slowdowns, a millisecond here, a millisecond there. But my code worked. I've been shipping the code Deep Research complained about for years, and it's running on more than 20,000 sites. If that code were causing a major slowdown, I would have heard about it. But unlike Codex, which only really works on code found in GitHub repos or VS Code workspaces, Deep Research can accept any file, including zip files. So I gave it the distribution zip for version 3.2 of my security software. I've been shipping 3.2 for four months and it's installed on 45.6% of my 20,000+ users' sites. We know that version isn't causing the problem. Right now, Codex and I are working on what will be released as 4.0, and it was 4.0 that had the freeze problem. Also: AI is more likely to transform your job than replace it, Indeed finds I told Deep exactly that, and that 3.2 worked fine. I told it to look at 4.0, examining only what was added since the 3.2 release. That focused its analysis run considerably. And guess what? It figured it out. It found a number of concerns. The biggest concern was that my main plugin was checking the status of a robots.txt file every single time a user accessed the site. This was something that only needed to be checked once, to determine if some features could load. But it was running constantly. On an active site (rather than my development machine), those checks tied up the PHP interpreter until they completed. It effectively killed the server. Deep Research identified the culprit. Living in the future is cool This is where things went from "my project is so screwed" to "I'm living in the future." I took the results from Deep Research and explained them to Codex. Since it was a fairly narrow explanation of a situation, Codex was immediately able to zero in on the problem. Its first solution was workable but still a bit problematic. We discussed it back and forth, and I was able to give it clear instructions. Also: The best AI for coding in 2025 (and what not to use) I told it to check the status of that file exactly once and remember the status. Then, I asked it to give me a button that a user could click to request a recheck of the status if something had changed in the server's configuration. Instead of running and freezing the server for every single web access, it ran once on startup, and once more if a website owner requested a recheck. Codex gave me a build that was promising. I uploaded it to my active server to see how it performed. It's been running actively now for about three days and my server is running just fine. Problem solved. I really did feel like I brought in different team members to look at my Code. Codex is my hired programmer. Deep Research is the specialist called in to diagnose an issue. And then Codex, as my staff programmer, went off and implemented the fix. AIs don't solve everything As amazing as this was to experience and work with, keep in mind that AIs aren't solving everything. First, the AI missed this problem. The only way I found out that this bug could have crippled my entire user base was due to my own human testing. Second, while the initial code was written very quickly, diagnosing this one problem took days. It took a lot of creative problem-solving on my part. Codex didn't suggest I call in Deep Research, and Deep Research didn't suggest comparing old code with new code. That was all human input. Also: The fastest-growing AI chatbot lately? It's not ChatGPT or Gemini Third, while I do have some cool AI overviews, productizing the software is taking time. I'm producing tutorials. I still have the product pages to put up. I haven't yet mastered the final distribution software. All that is my work. It's taking the time that doing that type of work takes. I will have four new products on the market within a month of starting. Before AI, that would have taken years. But the fact that the AI coded it all in four days is only one piece of the puzzle. For every day of AI coding, there's roughly a week of testing and product management on my part. Then comes the marketing, which is an entirely different effort. Still, we're getting close. I hope to have this stuff ship sometime in the next week or so. What about you? Have you ever run into a bug that only revealed itself after everything looked like it was working fine? Do you think teaming up different AI tools, like one for diagnosis, another for fixing, could become a standard workflow? Let us know in the comments below.

I teamed up two AI tools to solve a major bug - but they couldn't do it without me

Share this article

Related Articles