Apple researchers explore how AI can predict bugs, write tests, and even fix code

Apple has published three interesting studies that offer some insight into how AI-based development could improve workflows, quality, and productivity. Here are the details. Software Defect Prediction using Autoencoder Transformer Model In this study, Apple’s researchers present a new AI model that overcomes the limitations of today’s LLMs (such as “hallucinations, context-poor generation, and loss of critical business relationships during retrieval”), when analyzing large-scale codebases to detect and predict bugs. The model, called ADE-QVAET, aims to improve the accuracy of bug prediction by combining four AI techniques: Adaptive Differential Evolution (ADE), Quantum Variational Autoencoder (QVAE), a Transformer layer, and Adaptive Noise Reduction and Augmentation (ANRA). In a nutshell, while ADE adjusts how the model learns, QVAE helps it understand deeper patterns in the data. Meanwhile, the Transformer layer ensures the model keeps track of how those patterns relate to each other, and ANRA cleans and balances the data to maintain consistent results. Interestingly, this is not an LLM that analyzes the code directly. Instead, it looks at metrics and data about the code, such as complexity, size, and structure, and looks for patterns that may indicate where bugs are likely to occur. According to the researchers, these were the results when they measured the model’s performance on a Kaggle dataset made specifically for software bug prediction: “During training with a 90% training percentage, ADE-QVAET achieves high accuracy, precision, recall, and F1-score of 98.08%, 92.45%, 94.67%, and 98.12%, respectively, when compared to the Differential Evolution (DE) ML model.” This means that the model was both highly reliable overall, and very effective at correctly identifying real bugs, while avoiding false positives. Read the full study on Apple’s Machine Learning Research blog Agentic RAG for Software Testing with Hybrid Vector-Graph and Multi-Agent Orchestration This study was made by four Apple researchers, three of whom worked on the ADE-QVAET model. Here, they tackle a second time-consuming task faced by quality engineers, which is creating and maintaining detailed test plans and cases for large software projects. In this study, they develop a system that utilizes LLMs and autonomous AI agents to automatically generate and manage testing artifacts, ranging from test plans to validation reports, while keeping full traceability between requirements, business logic, and results. In other words, they built an AI system that can plan, write, and organize software tests on its own, which could help streamline the workflow of Quality Engineers, who “spend 30-40% of their time creating foundational testing artifacts, such as test plans, cases, and automation scripts.” Like with the ADE-QVAET model, the results here were pretty promising: “The system achieves remarkable accuracy improvements from 65% to 94.8% while ensuring comprehensive document traceability throughout the quality engineering lifecycle. Experimental validation of enterprise Corporate Systems Engineering and SAP migration projects demonstrates an 85% reduction in testing timeline, an 85% improvement in test suite efficiency, and projected 35% cost savings, resulting in a 2-month acceleration of go-live.” On the other hand, the researchers also noted that the framework presents limitations, including the fact that their work focused only on “Employee Systems, Finance, and SAP environments,” limiting its generalization capabilities. Read the full study on Apple’s Machine Learning Research blog Training Software Engineering Agents and Verifiers with SWE-Gym This is perhaps the most interesting and ambitious of the three studies. While the two previous studies focused on predicting where bugs are likely to appear and how they are tested and validated, the idea behind SWE-Gym is to train AI agents that can actually fix bugs by learning to read, edit, and verify real code. SWE-Gym was built using 2,438 real-world Python tasks from 11 open-source repositories, each with an executable environment and test suite so that agents could practice writing and debugging code in realistic conditions. The researchers also developed SWE-Gym Lite, which included 230 simpler and more self-contained tasks designed to make training and evaluation faster and less computationally expensive. According to the study, agents trained with SWE-Gym correctly solved 72.5% of the tasks, outperforming previous benchmarks by more than 20 percentage points. Meanwhile, SWE-Gym Lite reduced training time by almost half compared to the full setup, while delivering similar results. On the other hand, the Lite variant includes far fewer and much simpler coding tasks, which makes it less effective for testing models on larger, more complex problems. Read the full study on Apple’s Machine Learning Research blog Accessory deals on Amazon

Apple researchers explore how AI can predict bugs, write tests, and even fix code

Share this article

Related Articles