My 8 ChatGPT Agent tests produced only 1 near-perfect result - and a lot of alternative facts

ZDNET Last week, OpenAI unveiled Agent, its new tool that combines the capabilities of Deep Research and Operator. Operator was OpenAI's first attempt at a computer-using model, a model that actually can open windows and click on user interface elements. ChatGPT Agent can do that and more. Right now, ChatGPT Agent is only available for $200/mo Pro tier subscribers and provides for 400 agent interactions per month. When the $20/mo Plus tier gains access to Agent, which should be today, those users will get 40 interactions per month. Also: Microsoft is saving millions with AI and laying off thousands - where do we go from here? (Disclosure: Ziff Davis, ZDNET's parent company, filed an April 2025 lawsuit against OpenAI, alleging it infringed Ziff Davis copyrights in training and operating its AI systems.) I upgraded my plan from Plus to Pro just so I could test out the new Agent mode and report back to you. In this article, I'll show you detailed results from eight comprehensive tests. TL;DR test results Before we go into the detailed tests, I'll start with some overall TL;DR observations. Test count: In the past two days, I used 25 of the available 400 queries, for a total of almost 12 hours of hyper-uber-supercomputer use. No wonder this thing costs $200/month. Also: I found 5 AI content detectors that can correctly identify AI text 100% of the time Nearly every query required a follow-on, so when it comes time for Plus users, don't assume you can give Agent 40 projects. More likely, you'll be giving it 20-25, and using the rest of your queries to convince the Agent to follow directions. Screenshot by David Gewirtz/ZDNET Result quality: In all my tests, Agent appeared to understand the problem. But it failed to produce useful results for most of the tests. That said, the final test produced results that can only be characterized as amazingly useful. Project scale: Agent can't handle big projects, the sort of data analysis projects you really want an AI to be able to handle. It has trouble scrolling through web pages. It can't visit sites that have AI or robots.txt restrictions in place. And long processing exceeds session time allocations, even with the super top-of-the-line gold-pressed latinum Pro edition. Presentation quality: One of the major pitch points for Agent is its ability to create spreadsheets and presentations. It did okay with spreadsheets, but the graphic quality of the presentations was pretty rough. I expect this to change over time, but don't expect Agent to make presentations you can use without considerable cleanup. Accuracy: AIs hallucinate. The OpenAI team cautioned about using Agent because of the new risks involved. While I did get back some results that were accurate, Agent also came back with unforced errors, results it could have easily tested and deemed inaccurate. But no such verification or validation occurred. That said, the final test was accurate and shows what this tech can do when it works. Connectors: Agent comes with the ability to use connectors (via API calls) to link to Gmail, Google Calendar, Google Drive, Outlook, Dropbox, and more. I did not test out the connectors because of how often Agent hallucinates or does something fairly boneheaded. I just didn't feel comfortable enough to give Skynet access to my accounts. At least, not yet. Screenshot by David Gewirtz/ZDNET Limits: I was unable to use Agent in the MacOS app. I also found that Agent stalled hard when I tried to run it in multiple Chrome tabs at once. For now, you launch an Agent process and wait. It's not like Codex, where you can launch a bunch of projects and come back later and harvest all the results. But since that capability exists in Codex, I'm sure it will show up soon in Agent. Screenshot by David Gewirtz/ZDNET That should give you a pretty good overview. Let's get started looking at the eight test results. For each result, I've included a link to the session recording, so you can see the prompts I used, the detailed results, and watch Agent reason its way through the problem. Also, definitely read to the end. Some of the early results are fairly bad, but the last one knocks it out of the park. And with that, here we go. 1. Selecting products on Amazon Understanding of the problem: Solid Solid Execution: Both good and bad Both good and bad Hallucination: Weird church reference, fake Amazon links Weird church reference, fake Amazon links Processing time: 20 + 12 minutes When OpenAI introduced ChatGPT Agent, the team demoed how they used the tool to shop for wedding clothes and a wedding gift. That seemed like a fairly uncommon and impractical application for a super-intelligence, especially since gift registries exist and are widely used. Instead, I gave Agent a purchasing project I had actually extensively researched and completed a few months earlier. I'm running Power-over-Ethernet cables all across my yard to upgrade my security system. As such, I'm creating a lot of custom cables. I already know that doing so requires some key tools: a cutter to slice the cable, a cable end stripper, a crimper to attach the RJ-45 ends, and a tester to confirm that long cable runs work. Also: How a circuit breaker finder helped me map my home's wiring (and why that matters) I gave Agent a prompt asking for three configurations: a budget toolset, a "money-is-no-object" solution, and a sweet spot solution. I asked for links, product descriptions, and product images. Once you give Agent your prompt, it creates a virtual desktop. You can watch it conducting its activities, jumping between a desktop view, a text view, and code. Screenshot by David Gewirtz/ZDNET The budget solution turned out to be a win. Agent found a single $34 kit with everything I asked for. It presented a link, and even reasoning why it chose that solution. Unfortunately, the image it provided was nothing like the actual kit. Screenshot by David Gewirtz/ZDNET The mid-tier and top-tier solutions were less than perfect. None of the links worked. The mid-tier sweet spot solution did have a product-accurate image, but without a link, it wasn't really helpful. Screenshot by David Gewirtz/ZDNET Unfortunately, the model recommended doesn't actually exist on Amazon. In fact, none of the mid- or upper-tier products exist on Amazon. It looks like Agent did a pile of web surfing to find the products, disregarding my instructions to search only on Amazon. Screenshot by David Gewirtz/ZDNET It also clearly visited other sites, probably gathering model names and descriptions. Screenshot by David Gewirtz/ZDNET Then, when it packaged up its final recommendations, it just assigned random Amazon links to the description, even though those products and those links don't seem to exist on Amazon. Screenshot by David Gewirtz/ZDNET I did request it go back and try again. When it did, after 12 minutes, it presented most of the same products, although one of the links that had failed earlier did, in fact, point to a product on Amazon in the second run. Also: Coding with AI? My top 5 tips for vetting its output - and staying out of trouble I can't leave this section without pointing out something just plain weird. As I was watching Agent work, it presented this in its desktop view. I don't even want to know. Screenshot by David Gewirtz/ZDNET You can watch a replay of the entire session here. 2. Comparing egg prices Understanding of the problem: Solid Solid Execution: Did what I asked Did what I asked Hallucination: My fault for imprecise prompting My fault for imprecise prompting Processing time: 14 minutes In discussing ChatGPT Agent, OpenAI showed a slide that mentioned Instacart as one of the examples that the chatbot is comfortable working with. Since my family regularly uses Instacart, I decided to set Agent loose and see what it could tell me about egg prices at our local stores. I didn't let Agent have access to my account, but I shared my ZIP code here in Salem, Oregon. I told it to "Please visit all the grocery stores on Instacart and compare egg prices." Also: How to use ChatGPT to write code - and my top trick for debugging what it generates It did exactly that. You've heard the phrase Garbage In, Garbage Out. Well, that's what happens when you ask an AI to look at "all the grocery stores." I should have asked it to look in a 5 or 10 mile radius only. But I didn't. Screenshot by David Gewirtz/ZDNET Agent came back with 21 stores, ranging from nearby to up to almost 47 miles away. It did accomplish what I asked, comparing egg prices. Without prompting, it decided to rank the eggs by price. This was good. But when it chose the eggs to rank, it didn't always choose the least expensive product from each store. For example, it recommended the Good & Gather eggs from Target at $2.99 a dozen, rather than the $1.99/dozen Market Pantry egg, also from Target. Screenshot by David Gewirtz/ZDNET You can watch a replay of the entire session here. 3. Creating a PowerPoint slide Understanding of the problem: Solid Solid Execution: Added the correct data point Added the correct data point Hallucination: Was unable to reproduce graphic quality Was unable to reproduce graphic quality Processing time: 10 minutes Next up is a project I did early last week. With Congress focusing on Bitcoin, my editor asked me to update my Bitcoin investment article, where I've been tracking the value of a $50 Bitcoin investment since 2022. The value of my holdings went up, which means I needed to add a new slide. Each slide adds a date value on the X axis and a value point on the Y axis. From a PowerPoint fiddling standpoint, that meant moving over the graphics to make room for the new value and, in this case, adjusting the vertical scale to accommodate a substantial rise in value. Also: The best free AI courses When I did it, it took me about 45 minutes. Since OpenAI said that PowerPoint was one of ChatGPT Agent's strengths, I wanted to see if Agent could save me that time in the future. I uploaded my existing slide deck minus the last slide I made for the article. Then I asked Agent to create that slide for me. As it worked, the desktop view showed the terminal interface. You can see how Agent is putting together the code to generate a graphic image. Screenshot by David Gewirtz/ZDNET Here's what that slide should have looked like (note: foreshadowing). Screenshot by David Gewirtz/ZDNET Here's what Agent gave me. Screenshot by David Gewirtz/ZDNET To be fair, Agent clearly understood the problem. It moved the existing data points over to the left to make room for the new node. It also placed the new Bitcoin item properly in relation to the existing ones, and added both price and percentage change text blocks. That means Agent read and understood the context of my PowerPoint deck's layout. That, in and of itself, is very impressive. Also: The best AI for coding in 2025 (and what not to use) But it failed on adding more scale lines and new Y-axis values. It failed on reproducing the fonts. It failed on properly placing the text blocks. And it pushed the entire graphic up and to the left of the slide. I'm guessing the graphics library that Agent uses isn't really up to the task of making fine graphic changes. That will undoubtedly improve over time. You can watch a replay of the entire session here. 4. Article categorization (method II) Understanding of the problem: Solid Solid Execution: Failed due to exceeding allowable session time Failed due to exceeding allowable session time Hallucination: Gave me back partial results Gave me back partial results Processing time: 8 minutes + 3 minutes + 21 minutes Each week for the past two years, I've published a newsletter that shares with followers the articles I published here on ZDNET for the week. Each newsletter contains a title, link, and article description. By pointing Agent to my back issue archive, it would have close to 300 article summaries to categorize. Unfortunately, Agent ran into a number of problems of its own making. It was unable to successfully scroll through the article list using JavaScript. When I told it to use the web interface, it started to, but it reported, "Unfortunately, I've reached the end of the allotted browsing sessions for this task, which means I'm unable to explore further pages and collect the additional data at this time." Also: Is ChatGPT Plus really worth $20 when the free version offers so many premium features? Remember, I'm paying $200 a month for OpenAI's best plan, and it still won't give me enough time to look up 300 articles. That's a gotcha, right there. It's also disappointing because a task like scrolling back through an article archive and doing some tabulating is exactly the sort of task you might give to an assistant. If the AI gives up because it takes too long, then we can't really rely on AI for all the assistant type things. No one wants a fussy, picky assistant. In any case, Agent did give me back a spreadsheet and a slide based on the limited data it was able to find before my little request exceeded the hourly power budget for the City of Las Vegas (or so I imagine). Screenshot by David Gewirtz/ZDNET You can watch a replay of the entire session here. 5. Extract remembered text from video Understanding of the problem: Partial Partial Execution: Didn't return full transcript on first run, correct on second run Didn't return full transcript on first run, correct on second run Hallucination: Decided to do what it wanted on first run Decided to do what it wanted on first run Processing time: 2 minutes I watch a lot of YouTube videos to augment my learning and research. Plus nothing beats a good relaxing video about how pavers are made. While it's fairly easy to get a transcript of a full video, whether directly from YouTube or using Apple Voice Memos, locating where in a video a segment you want to explore can take time. Here's an example. When OpenAI introduced Agent in a video, CEO Sam Altman discussed some of the cautions and warnings about using ChatGPT Agent mode. I did remember they were near the end of the video, but I didn't want to spend time sifting through to get the exact quotes. Instead, I delegated that assignment to Agent. On its first run, it found the segment easily enough, but instead of returning a word-for-word transcript, it returned some quotes, interspersed with its own analysis. Also: I mapped my iPhone's Control Button to ChatGPT - here are 5 ways I use it every day I clarified what I wanted and, on its second run, it gave me exactly what I needed. In this case, though, it wasn't that my prompt was unclear. I just had to insist a second time that I wanted a transcript for the AI to do what I asked. Unfortunately, this extra review cycle diminished the time-saving value to me. I still think using Agent was faster than if I sifted through the video myself. But I had to construct a second prompt and wait for a second result, all of which took my time. Still, this is a helpful tool. You can watch a replay of the entire session here. 6. Creating a trend analysis presentation Understanding of the problem: Solid Solid Execution: Good, except for slide visual quality Good, except for slide visual quality Hallucination: Too much data to confirm or deny assertions Too much data to confirm or deny assertions Processing time: 32 minutes As part of my job, it's important to be able to keep up with ongoing tech and business trends. As such, I often spend days in deep dives, coming up to speed on new topics. I wanted to see if ChatGPT Agent could save me some time by preparing a report and a full presentation on remote work trends. I told it that the PowerPoint was destined for my management team, so it should be comprehensive and professional-looking. It returned an analysis document very similar to the results we've been getting from ChatGPT deep research. The report contains a large number of assertions and statistical claims, most of which I don't have time to research for confirmation. Also: ChatGPT can record, transcribe, and analyze your meetings now Most of the top-level conclusions are congruent with my understanding of current work-from-home trends. That said, we're familiar with the model's propensity for hallucination, so I'd be very concerned about using any of this data professionally without additional vetting. Agent did produce a 17-slide PowerPoint deck that was organized quite well. As with previous experiments, the graphic generation quality was a bit off. The first slide actually looks quite good. Screenshot by David Gewirtz/ZDNET But later in the deck, it doesn't look right. Notice how the following slide has graphics on top of text, and bullets in front of bullets on top of empty bullets. Screenshot by David Gewirtz/ZDNET In the following slide, not only is the text running off the end of the page, but there's no legend. As such, it's not clear what's represented by red and by blue. Screenshot by David Gewirtz/ZDNET Once again, you can see how Python is used to construct the deck. Screenshot by David Gewirtz/ZDNET Agent does a fair job, so I'm fairly confident that the AI will get better over time. Programmatic construction of slides based on templates is not a new technology. I just don't think OpenAI prioritized slide presentation aesthetics as part of this release. You can watch a replay of the entire session here. 7. Vetting a presentation for accuracy Understanding of the problem: Solid Solid Execution: Good Good Hallucination: Seems complete, but it's still from an AI Seems complete, but it's still from an AI Processing time: 11 minutes + 7 minutes Well, this was just plain fun. I decided to give the presentation created in the previous test to a new fresh ChatGPT Agent session and asked it to validate the claims. Agent concluded, "Several quantitative claims—especially those concerning productivity/innovation impacts, the size and growth of the gig economy, rates of side‑gig participation, and the influence of politics and culture—could not be verified with accessible evidence during this review." Agent provided a detailed analysis of each assertion. I've summarized the results below. Adoption timeline: Mostly confirmed Global comparison: Confirmed Workforce composition: Confirmed Migration: Confirmed Mobility of remote workers: Confirmed Housing & local economies: Confirmed Office vacancy & environmental impacts: Mostly confirmed Social connections & wellbeing: Partly confirmed Employer attitudes & return‑to‑office mandates: Mostly confirmed Employee preferences & pay cuts: Mostly confirmed Productivity & innovation: Partly confirmed Gig economy & freelancing: Unverified Freelancing motivations & challenges: Not strictly factual claims Side gigs & multiple jobs: Unverified Demographics & equity: Partly confirmed / mixed Political & cultural influences: Partly confirmed / mostly unverified Other factors & policy landscape: Generally accurate but qualitative As you can see, of the 17 data points, Agent considered only five to be fully confirmed. Contrast this with how GPT-4o analyzed the results. When GPT-4o was given the same PowerPoint deck, it considered all assertions to be confirmed. You can see GPT-4o's detailed results here. Even though I used the AI to validate the AI, I probably wouldn't be comfortable using any of the presumed facts in my work without personal, Mark I Eyeball confirmation. Still, it was a fun exercise, and fascinating to see how different the results were between ChatGPT Agent and ChatGPT 4o. You can watch a replay of the entire session here. 8. Analyze building code for fence installation Understanding of the problem: Solid Solid Execution: Pretty close to perfect Pretty close to perfect Hallucination: None. It got all but one graphic just right None. It got all but one graphic just right Processing time: 4 minutes Back when we lived in Palm Bay, Florida, we lived on a corner property. The house came with what could only charitably be called a fence. We needed to replace it, and since we wanted privacy, we wanted to see just how much fence we could legally install. Over the course of a couple of years, I spent a ton of time going back and forth with the planning office in an effort to both understand what I could do with a fence, and what other alternatives might be available to me. Since I have a lot of history with this project and am very familiar with Palm Bay codes (even years after moving away), I decided to point ChatGPT Agent at the problem. It took all of four minutes to provide a detailed, accurate analysis. It even created working diagrams that illustrated the options. Based on my experience, I know the results to be accurate. Screenshot by David Gewirtz/ZDNET ChatGPT Agent produced output that could be used to take this project to the next step. Back when I lived in Palm Bay, the equivalent probably took me 20 calls, a ton of emails, and a few visits to City Hall to come up with options. The level of presentation and organization I came up with wasn't even close. If Agent can up its game elsewhere to be on a par with this test, then it will have some legs. You can watch a replay of the entire session here. What's it all mean? Well, it sure as heck isn't sentient yet. At best, it's like that administrative assistant you hired because your mom said you had to hire her cousin's unemployable slacker kid. There are occasional flashes of brilliance, but mostly the output seems like the result of both aggressively following directions and purposely inventing alternative facts. Is it worth $200/month for the Pro program? Not for Agent. At least not yet. Agent is unreliable and generally performs fairly poorly. In a year or so, I'm sure it will get better. But now? No. The only reason to spend $200 a month on it is to do what I'm doing: testing it to see where the technology is today. Stay tuned, because despite all the inaccuracies and problem areas, this definitely shows where AI technology could go. Of course, if a web browsing AI Agent is the future, and all the content sites out there block it because AI is stealing our content, then we'll have a very interesting problem. Also: I'm an AI tools expert, and these are the only two I pay for (plus three I'm considering) It's early days, folks. Whether this is a technology that will be a boon to all humanity or a technology that destroys the internet and kills us in our sleep remains to be seen. But hey, in the meantime, I and the rest of the ZDNET team will be trying to make sense of it all for you. So keep coming back. We'll have more to tell you. I'll be tinkering with Agent and I'm sure I'll have more to say as well. Have you tried ChatGPT Agent yet? If so, did it follow your instructions accurately or veer off into its own interpretation of the task? Did it hallucinate or hit the mark? How do you feel about giving AI tools access to your files, accounts, or browser? Are you seeing more value in this kind of automation, or are you still waiting for it to become useful? Let us know in the comments below. You can follow my day-to-day project updates on social media. Be sure to subscribe to my weekly update newsletter, and follow me on Twitter/X at @DavidGewirtz, on Facebook at Facebook.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, on Bluesky at @DavidGewirtz.com, and on YouTube at YouTube.com/DavidGewirtzTV.

My 8 ChatGPT Agent tests produced only 1 near-perfect result - and a lot of alternative facts

Share this article

Related Articles