­

The results for the Survey on Production MLOps are out: ethical.institute/state-of-ml-2025 🚀🚀🚀 As part of the release we have updated the interface enabling real time toggling between 2024 and 2025 data, and have refreshed a cool new code-editor theme 😎 Check it out and share it around!!

Further insights: In 2025, the top 5 challenges organisations face in Production ML are:


  • ML System Monitoring 16% (+2% YoY), Access to Data 14% (+2% YoY), Data & ML Pipelines 13% (+2% YoY), Training/Experimentation Env Parity 12% (+1% YoY), ML Governance 8% (+2% YoY), and Security 8% (+6% YoY). 
  • It is super interesting to see the big jump on ML Security from 2% all the way to 8%, as well as the continued increase of ML Monitoring as the top 1 challenge, which reflects the awareness and maturity of organisations in understanding the implications and risks of production ML systems. 
  • It is also interesting to see how last year Gaps in ML Tooling was the top 2 on the list with 12% and this year it's down to 8% completely out of the top 5 which reflects how choice is no longer becoming the blocker, and instead it's the cohesive integration and robust productionisation. 
  • Finally it seems that also Engineering Talent is slowly declining with 8%this year vs 10% in 2024, which again shows that the skill gap is slowly closing. 


The top 5 challenges in production ML in 2025 do seem to reflect some of the qualitative trends that I see on our day-to-day. If you want to dive deeper you can access the full results here: https://ethical.institute/state-of-ml-2025🔥

Issue #365 🤖 

­
­

Thank you for being part of over 70,000+ ML professionals and enthusiasts who receive weekly articles & tutorials on Machine Learning & MLOps 🤖 You can join the newsletter https://bit.ly/state-of-ml-2025 ⭐

 

If you like the content please support the newsletter by sharing with your friends via ✉️ Email, 🐦 Twitter, 💼 Linkedin and 📕 Facebook!

­
­

This week in ML Engineering:

 

  • PyTorch Hardware Acceleration in 2025
  • Anthropic MCP now in Linux Foundation
  • The State of MLOps 2025 Survey 🔥
  • OpenAI Enterprise AI Report
  • AI Eats the World 2025
  • Open Source ML Frameworks
  • Awesome AI Guidelines to check out this week
  • + more 🚀
­
­

PyTorch Hardware Acceleration in 2025


Next year may be the critical shift from NVIDIA CUDA towards AMD ROCm and other GPUs for ML compute; the 2025 State of Pytorch Hardware Acceleration Report has some interesting insights: PyTorch has become so ubiquitous that it can help provide a practical assessment on the state of maturity of hardware (GPU) accelerators. One interesting insight is that a key differentiator is not just the FLOPs, but largely the maturity of the software platform support on the end-to-end PyTorch 2.x runtime stack. It is clear that NVIDIA CUDA (ie H100/Blackwell) is still the operational gold standard for production training and serving because it has the most mature compiler path, widest kernel/operator coverage, and the least debugging/install friction (oh hi AMD!). It does seem that AMD ROCm has improved substantially and is now becoming a differentiator as it's cost-effective but anyone that has used it knows how painful the setup actually is; getting a smooth user experience is often easier said than done, and NVIDIA has a signficant advantage in both time/experience and developer ecosystem maturity. However as we approach 2026 there are quite a lot of options arising, such as Google TPUs, Apple Silicon (MPS), and many others that are being supported with low-level frameworks such as Vulkan (one of them we maintain under Vulkan Kompute!).

Anthropic MCP now in Linux Foundation

 

The need for a unified + scalable standard to bridge Agentic systems is only becoming more critical, which is why I am quite excited to see the move from Anthropic donating MCP to the Linux Foundation! We know that MCP as a protocol has quite a lot of flaws, not just in terms of functionality but also in terms of security, robustness, etc - however I am a believer that more often than not, a bad standard can be better than no standard. There are indeed many alternatives to MCP emerging, especially on how fast the domain is moving, but I do believe that it's likely easier to evolve towards a 2.0 MCP protocol within the backing of LF vs trying to standardise on something new that is not as widely adopted as a base. I have also been following how the standards for the protocol have been evolving through proposal / review processes for new features, and I keep getting surprised on the pace of innovation (some good, some not so good - but arguably in the right direction). This is certainly an interesting space to keep an eye, as it is clear it will continue to evolve at an accelerated pace!

The State of MLOps 2025 Survey 🔥

 

In 2025, the top 5 challenges organisations face in Production ML are:  ML System Monitoring 16% (+2% YoY), Access to Data 14% (+2% YoY), Data & ML Pipelines 13% (+2% YoY), Training/Experimentation Env Parity 12% (+1% YoY), ML Governance 8% (+2% YoY), and Security 8% (+6% YoY). It is super interesting to see the big jump on ML Security from 2% all the way to 8%, as well as the continued increase of ML Monitoring as the top 1 challenge, which reflects the awareness and maturity of organisations in understanding the implications and risks of production ML systems. It is also interesting to see how last year Gaps in ML Tooling was the top 2 on the list with 12% and this year it's down to 8% completely out of the top 5 which reflects how choice is no longer becoming the blocker, and instead it's the cohesive integration and robust productionisation. Finally it seems that also Engineering Talent is slowly declining with 8% this year vs 10% in 2024, which again shows that the skill gap is slowly closing. The top 5 challenges in production ML in 2025 do seem to reflect some of the qualitative trends that I see on our day-to-day. If you want to dive deeper you can access the full results here: https://ethical.institute/state-of-ml-2025

OpenAI Enterprise AI Report

 

OpenAI released their 2025 State of Enterprise AI Report, where they brought together responses from 9000 workers across 100 companies to assess the trends, challenges and opportunities: Some of the outcomes won't come as surprises, as one of the main takeaways was that people are moving away from ad-hoc prompting towards scalable / repeatable production workflows. It is no surprise they have also seen a surge in ChatGPT Enterprise message volume grew by 8x YoY, with significant increase using workflows via Projects/Custom GPTs rising ~19x YoY. One key insight next year will be the impact that Google will have on these numbers given their aggressive and accelerated momentum in the enterprise space. Other interesting insights: workers report measurable productivity gains - an OpenAI report likely would say so, however as we have seen there are contrasting views from domain to domain, so it will be interesting to explore as this matures in the year to come.

AI Eats the World 2025

 

Ben Evans releases his annual keynote on "AI Eats the World" for the 2025 edition with some interesting insights: We currently seeing investments forecasted on capex (compute) at nearly $400 billion in 2025 which as of today is disproportionately larger than the ROI so a major shift will be required next year. The landscape is also transitioning from model differentiation to model commoditization; namely performance on general benchmarks is converging, which means that the competitive "moat" moves toward traditional software advantages like distribution, proprietary data, and product UX. We can also assume that this "infinite interns" premise that LLMs theoretically provide will likely trigger a Jevons paradox, namely expanding the volume and complexity of work organizations can handle rather than merely replacing labor.

­

Upcoming MLOps Events

 

The MLOps ecosystem continues to grow at break-neck speeds, making it ever harder for us as practitioners to stay up to date with relevant developments. A fantsatic way to keep on-top of relevant resources is through the great community and events that the MLOps and Production ML ecosystem offers. This is the reason why we have started curating a list of upcoming events in the space, which are outlined below.

 

Conferences for 2026 coming soon! For the meantime, in case you missed our talks:

 

  • The State of AI in 2025 - WeAreDevelopers 2025
  • Prod Generative AI in 2024 - KubeCon AI Day 2025
  • The State of AI in 2024 - WeAreDevelopers 2024
  • Responsible AI Workshop Keynote - NeurIPS 2021
  • Practical Guide to ML Explainability - PyCon London
  • ML Monitoring: Outliers, Drift, XAI - PyCon Keynote
  • Metadata for E2E MLOps - Kubecon NA 2022
  • ML Performance Evaluation at Scale - KubeCon Eur 2021
  • Industry Strength LLMs - PyData Global 2022
  • ML Security Workshop Keynote - NeurIPS 2022
­
­

Open Source MLOps Tools

 

Check out the fast-growing ecosystem of production ML tools & frameworks at the github repository which has reached over 10,000 ⭐ github stars. We are currently looking for more libraries to add - if you know of any that are not listed, please let us know or feel free to add a PR. Four featured libraries in the GPU acceleration space are outlined below.

 

  • Kompute - Blazing fast, lightweight and mobile phone-enabled GPU compute framework optimized for advanced  data processing usecases.
  • CuPy - An implementation of NumPy-compatible multi-dimensional array on CUDA. CuPy consists of the core multi-dimensional array class, cupy.ndarray, and many functions on it.
  • Jax - Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more
  • CuDF - Built based on the Apache Arrow columnar memory format, cuDF is a GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data.

 
If you know of any open source and open community events that are not listed do give us a heads up so we can add them!

­
­

OSS: Policy & Guidelines

 

As AI systems become more prevalent in society, we face bigger and tougher societal challenges. We have seen a large number of resources that aim to takle these challenges in the form of AI Guidelines, Principles, Ethics Frameworks, etc, however there are so many resources it is hard to navigate. Because of this we started an Open Source initiative that aims to map the ecosystem to make it simpler to navigate. You can find multiple principles in the repo - some examples include the following:

 

  • MLSecOps Top 10 Vulnerabilities - This is an initiative that aims to further the field of machine learning security by identifying the top 10 most common vulnerabiliites in the machine learning lifecycle as well as best practices.
  • AI & Machine Learning 8 principles for Responsible ML - The Institute for Ethical AI & Machine Learning has put together 8 principles for responsible machine learning that are to be adopted by individuals and delivery teams designing, building and operating machine learning systems.
  • An Evaluation of Guidelines - The Ethics of Ethics; A research paper that analyses multiple Ethics principles.
  • ACM's Code of Ethics and Professional Conduct - This is the code of ethics that has been put together in 1992 by the Association for Computer Machinery and updated in 2018.


If you know of any guidelines that are not in the "Awesome AI Guidelines" list, please do give us a heads up or feel free to add a pull request!

­
­

About us

­

The Institute for Ethical AI & Machine Learning is a European research centre that carries out world-class research into responsible machine learning.

Check out our website

 

✉️ Email, 🐦 Twitter, 💼 Linkedin

­

This email was sent to clgdkag7vq1e4pl4shei@kill-the-newsletter.comYou received this email because you are registered with The Institute for Ethical AI & Machine Learning's newsletter "The Machine Learning Engineer"
 

Unsubscribe here

Sent by Brevo

© 2023 The Institute for Ethical AI & Machine Learning