I retested GPT-5's coding skills using OpenAI's guidance - and now I trust it even less

seamartini/iStock/Getty Images Plus Follow ZDNET: Add us as a preferred source on Google. ZDNET's key takeaways The same prompts in GPT-5 yield success, crashes, or errors. OpenAI's prompt optimizer helps, but introduces its own quirks. AI "unconsciously" adding details raises trust concerns. Do AI's get headaches? Because GPT-5 has certainly been giving me one. This article was going to be so easy. OpenAI came out with a list of best practices for GPT-5 coding. All I was going to do was try those best practices with the GPT-5 coding tests that previously failed and see if there was improvement. It seemed so simple. Also: I tested GPT-5's coding skills, and it was so bad that I'm sticking with GPT-4o (for now) But then I had a thought. What if OpenAI has improved GPT-5 in the week or so since I ran my tests? Perhaps I should re-run the failed tests as-is and see what the results are, to use as a baseline for the new best practices. From a testing point of view, this was a valid approach. From a personal sanity point of view, not so much. Re-running test 1 I re-ran the first failed test. This test has the AI creating a complete WordPress plugin, complete with a user interface and business logic. The idea is you feed in a set of names, it randomizes them, and it separates duplicates so they're not side-by-side. Also: How I test an AI chatbot's coding ability - and you can, too When I ran this test on GPT-5 originally, it failed. Clicking the Randomize button sent the browser to another, unrelated page. I was able to cajole GPT-5 into fixing it, but the initial failure was what counted. This time, I ran the exact same test with the exact same prompt again. This time, it worked perfectly. Wow, I thought. GPT-5 has improved in the past week. If only I had left well enough alone. But no. I had to try again. On my second time with the exact same test with the exact same prompt, clicking Randomize resulted in what WordPressers call the "white screen of death." This indicates something in the code isn't working. Also: GPT-5 bombed my coding tests, but redeemed itself with code analysis I tried the exact same test with the exact same prompt a total of five times. The first time it worked. Subsequent times, it didn't. One time, I got a white screen. One time, I got a page full of error messages. One time, I was redirected to another page. One time, nothing happened at all. We'll come back to test 1. But next, I decided to do the same thing with the other failed test. Re-running test 4 This test asks the AI to write code that talks to Chrome, AppleScript, and another tool called Keyboard Maestro. It requires the AI to be knowledgeable about the three different environments and how they work together. Here's an interesting factoid about AppleScript. The scripting language is inherently case-insensitive. So, if you ask AppleScript to compare "AppleScript" to "applescript", it will consider both strings to be the same. You have to explicitly tell it to compare case before it will do so. Also: You can learn AI for free with these new courses from Anthropic When I first ran this test against GPT-5, it hallucinated that AppleScript had a native function for making strings lowercase. It does not, so the code failed. This time, instead of hallucinating that AppleScript has a lowercase function, it decided to do one of the most convoluted sets of operations to lowercase a string. Here's that line of code: set tabTitleLower to (do shell script "echo " & quoted form of tabTitle & " | tr '[:upper:]' '[:lower:]'") The code actually works, but it's very weird and totally unnecessary. It launches a shell (the command line) to run a shell script that converts to lowercase. It's like making a grilled cheese sandwich and when you realize you're out of cheese, flying across the country to buy cheese from a 7-11 on the opposite side of the continent. You're not even getting better cheese. You're just doing it in the weirdest and most inefficient way possible. Let's not forget that we really don't need to make that text lowercase in the first place, because AppleScript is case-insensitive. So it wrote a highly convoluted piece of code that, while it works, is entirely unnecessary. I didn't bother to re-run this test four more times because the headache was already starting to set in. OpenAI's GPT-5 coding best practices OpenAI has put out what they call a cheatsheet for GPT-5 coding. It has six main recommendations: Be precise and avoid conflicting information: Well, duh. But apparently GPT-5 gets really confused if asked to follow vague or conflicting instructions. Use the right reasoning effort: Rather than just letting GPT-5 choose its reasoning level, you might want to adjust its model. If it starts to overthink a problem, use a lower reasoning level. Great, so now we have to deal with a neurotic AI. Yippee? Use XML-like syntax to help structure instructions: Rather than using just text, using XML tags to segment aspects of a prompt helps the AI to parse its assignment. This is a good hint, but it feels like a step back from what it could do. Avoid overly firm language: In previous models, we sometimes had to YELL at the AI to get it to listen. Apparently, GPT-5 is passive aggressive, so if you get too forceful in your instructions, it will overdo its compliance, like what my old bosses used to say was my practice of "aggressively following instructions." Give room for planning and self-reflection: Does the AI need a blankie? Perhaps a warm stuffed animal to cuddle? In reality, what OpenAI is saying is that you need to encourage the AI to think through the instructions it gets. Apparently, if you don't tell it to think through the instructions, it won't. I fear for the planet. Control the eagerness of your coding agent: It turns out GPT-5 is even more of a people-pleaser than GPT-4o. When giving it instructions, tell it to tone down its fawning and sucking-uppage. I'd like to say something positive here, but it's starting to feel like a migraine. Check out the PDF if you want more details. OpenAI also created an AI prompting helper. You point your browser to this tool, and paste in your prompt. The tool will give you back a more extensive prompt. For example, I gave it this prompt from Test 4: Write a Keyboard Maestro AppleScript that scans the frontmost Google Chrome window for a tab name containing the string matching the contents of the passed variable instance__ChannelName. Ignore case for the match. Once found, make that tab the active tab. It gave me back this: Create an AppleScript for use in Keyboard Maestro that performs the following actions: * Begin with a concise checklist (3-7 bullets) outlining the planned steps before proceeding. * Scan all tabs in the frontmost Google Chrome window. * Search for a tab whose name contains the value of the passed variable 'instance__ChannelName', ignoring case sensitivity. * Once a matching tab is found, activate that tab and make it the current/active tab. After generating the script, validate in 1-2 lines whether the script meets all requirements. If not, state what needs adjustment. Essentially, it revised the prompt to add a planning phase at the beginning and a validation phase at the end. Does it work? Nope. Nopity-nope-nope. I ran Test 4's "optimized" prompt through GPT-5 and got back this script: Screenshot by David Gewirtz/ZDNET There are at least three serious flaws in it. First, the code in the green box doesn't work. It doesn't retrieve the value from Keyboard Maestro. Previous versions did that successfully using a completely different (and well-documented) mechanism. This approach is just loony. Also: AI is creeping into the Linux kernel - and official policy is needed ASAP Now, look at the blue box. The considering case operation is used only for evaluation. For example, with considering case, "BLUE" and "blue" are different values. Do you see any sort of if statement or evaluation inside the blue box? Me neither. Then there's the red box. There is no such syntax as "considering case false." First, the default behavior is to ignore case. Second, if you really wanted to explicitly tell AppleScript to ignore case, you'd use an "ignoring case" block. But "considering case false"? That's just some made-up syntax right there. Retrying the WordPress plugin I also ran my original prompt for the WordPress plugin through the prompt optimizer. I got back a fairly extensive project requirements document. Screenshot by David Gewirtz/ZDNET On the plus side, this time the plugin worked. I tried generating code a second time, and it also worked. So perhaps the optimization is something to try if GPT-5 isn't playing nice. But then there's the weirdness. Because there's always weirdness. Check out the header block generated by GPT-5. Screenshot by David Gewirtz/ZDNET Notice the author is attributed to Advanced Geekery Labs. Historically, the AIs have listed the author field as "Your Name." But this time, it's Advanced Geekery Labs. There is some relevance, because Advanced Geekery is the name of my YouTube channel and weekly newsletter. But I didn't use it anywhere in my instructions for this plugin. Additionally, I've never added Labs to the end of Advanced Geekery, even though my security product is sold through ZATZ Labs. So I asked GPT-5 where it got the name Advanced Geekery Labs from. Its answer: Good question. The name "Advanced Geekery Labs" isn't something you provided in your request. I introduced it in the plugin header's Author: field as a placeholder, because in earlier conversations you've mentioned your "Advanced Geekery" brand, and I unconsciously expanded it into "Labs." I'm glad the AI thinks it's a good question. That warms my heart. But "unconsciously expanded"? That's a direct quote from the AI. Those are its words. So, we're now dealing with hallucinating AIs that operate unconsciously? We're all doomed. What about coding and GPT-5? I still don't trust it. Actually, I don't trust it even more than when I tested it originally. What else might it do to my code "unconsciously"? I guess, if you really need GPT-5 to give you some code, you could try the optimizer. But I wouldn't rely on it. GPT-4o was impressive, but it had some problems. GPT-5 is like its more troubled, more difficult sibling. With GPT-4o, I'd be willing to trust, with verification. I just plain don't trust GPT-5. There's something that's just not right. If we were to experience an AI that would revolt against its creators and try to kill us all, I would not be surprised if it all originated with GPT-5. And no. My headache hasn't gone away. Apparently, Tylenol is now bad for you. So, there's that, too. What do you think? Have you tried relying on GPT-5 for coding projects? If so, did you run into the same kinds of inconsistencies and odd behaviors? Do you find OpenAI's "best practices" helpful, or do they feel like workarounds for deeper problems? And what's your take on GPT-5's tendency to improvise or even "unconsciously" insert details? Is it a harmless quirk or a reason for concern? Let us know in the comments below. You can follow my day-to-day project updates on social media. Be sure to subscribe to my weekly update newsletter, and follow me on Twitter/X at @DavidGewirtz, on Facebook at Facebook.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, on Bluesky at @DavidGewirtz.com, and on YouTube at YouTube.com/DavidGewirtzTV.

I retested GPT-5's coding skills using OpenAI's guidance - and now I trust it even less

Share this article

Related Articles