Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now
AWS seeks to extend its market position with updates to SageMaker, its machine learning and AI model training and inference platform, adding new observability capabilities, connected coding environments and GPU cluster performance management.
However, AWS continues to face competition from Google and Microsoft, which also offer many features that help accelerate AI training and inference.
SageMaker, which transformed into a unified hub for integrating data sources and accessing machine learning tools in 2024, will add features that provide insight into why model performance slows and offer AWS customers more control over the amount of compute allocated for model development.
Other new features include connecting local integrated development environments (IDEs) to SageMaker, so locally written AI projects can be deployed on the platform.
SageMaker General Manager Ankur Mehrotra told VentureBeat that many of these new updates originated from customers themselves.
“One challenge that we’ve seen our customers face while developing Gen AI models is that when something goes wrong or when something is not working as per the expectation, it’s really hard to find what’s going on in that layer of the stack,” Mehrotra said.
SageMaker HyperPod observability enables engineers to examine the various layers of the stack, such as the compute layer or networking layer. If anything goes wrong or models become slower, SageMaker can alert them and publish metrics on a dashboard.
Mehrotra pointed to a real issue his own team faced while training new models, where training code began stressing GPUs, causing temperature fluctuations. He said that without the latest tools, developers would have taken weeks to identify the source of the issue and then fix it.
Connected IDEs
... continue reading