Society's Backend 01月03日
3 Key Principles for AI at Scale [Part 2]
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文深入探讨了在规模化机器学习中至关重要的三大支柱:可扩展性、效率和速度。文章强调,仅仅拥有足够的计算资源是不够的,还需关注如何有效地利用这些资源。可扩展性确保能够完成机器学习实验过程,效率关注如何最大化资源利用率,速度则旨在优化实验输出。这三者相辅相成,共同决定了机器学习的成功。文章还提及,公司若能在这三方面都做到卓越,将能在AI竞争中脱颖而出。

🚀 **可扩展性**: 这是规模化人工智能的基础,意味着拥有足够的计算资源和基础设施来训练大量模型,测试多个改进假设。不仅要关注计算能力,还要确保能够扩展模型实验,包括调度训练和运行模型的数据、训练和服务管道。可靠的模型训练对于规模化AI至关重要,失败的实验会降低计算资源的利用率。

💡 **效率**: 在扩展训练后,下一步是尽可能提高效率。关键在于理解系统的低效之处,并追踪硬件使用率和失败率等指标,以便进行改进。资源利用率是首要跟踪指标,例如硬件的“正常运行时间”,优化资源分配和模型设计,减少重复工作。通过知识蒸馏或迁移学习等方法来提高数据效率,减少资源消耗,同时提高模型性能。

⏱️ **速度**: 速度旨在最大化系统的实验输出。要优化模型人员在实验上花费的时间,并最大化在一定时间内运行的有价值的实验数量。主要衡量指标包括:端到端实验的延迟、在一定时间内完成的实验吞吐量以及在给定时间内能够处理的实验数量。通过这些指标,可以发现系统效率低下之处,并进行权衡,优化系统整体性能。

This is the second in a two part series. This article assumes you have the knowledge shared in the first. If you haven’t read it yet, check it out at A Fundamental Overview of Machine Learning Experimentation [Part 1]. Subscribe so you don’t miss the next article in this series.

Subscribe now

Don’t forget to follow me on X and LinkedIn for more frequent tidbits about machine learning engineering and all things AI.


This is an article I’ve wanted to write for a looooong time because this is exactly what I work on at Google. Everyone knows compute is a necessity for effective machine learning, but too few people realize it’s the bare minimum.

I’m going to get into the three fundamental building blocks for successful machine learning at scale:

    Scalability

    Efficiency

    Velocity

These are all crucial aspects required for effective machine learning experimentation. Let’s jump in!

Scalability

This is the simplest principle and is easily understood after reading part 1. The first key principle for AI at scale is actually being able to complete the machine learning experimentation process.

This means having the compute and necessary infrastructure to train many models to test multiple hypotheses for model improvements. When most people discuss scaling up AI, they only mention having more compute. In reality, that is only the first step to achieve AI at scale.

The important question regarding scalability is: “Can I train enough models to experiment effectively?

The first step in this process is getting enough compute and the second step is ensuring you can scale up model experimentation with the compute you have. This includes scheduling training and having the infrastructure to run models through the data, training, and serving pipelines.

It’s important to note that scalability also includes reliability. Reliable model training is required to scale up AI otherwise failed experiments will diminish the scale of experimentation for a given amount of compute.

Efficiency

Now that you’ve scaled your training up enough to accommodate the ML experimentation process, the next step is making that process as efficient as possible. The question you can ask yourself regarding efficiency is: “How well am I using my resources?” or “How can we get more output per dollar spent?

The key for maximizing efficiency is understanding how inefficient your system is in a way that allows for identifying improvements. You might have read that and thought, “Well, duh,” but there’s a huge difference between knowing a system is inefficient because training constantly fails and knowing a system is inefficient because you’re meticulously tracking hardware usage and failure rates—one allows for improvement and the other doesn’t. Put more succinctly: Metrics that aren’t tracked won’t be improved.

This is especially the case with machine learning because of how many trackable metrics are involved in creating an efficient machine learning experimentation system due to the scale required.

The first metric to track is resource utilization. This tracks hardware “uptime” or the amount of time compute is actually being utilized. Idle GPUs are wasted resources.

Tracking resource utilization gives insight into system optimizations. Methods for achieving this include:

An example of an efficiency improvement that can be made is tracking metrics to understand data efficiency and using methods such as knowledge distillation (learn more about it here) or transfer learning (further training a pre-trained model to perform a different task) to reduce resource usage while improving model performance.

The important thing to note about efficiency is it allows companies to compete. Companies won’t last if they aren’t able to scale and efficiently experiment.

I’ve said it many times: efficiency will be a huge talking point for AI in 2025. Doing more with less will be important not just for cost savings, but also for improving model performance. See my post on X about it below.

Click to access X post.

Velocity

The third principle for machine learning at scale is velocity. It aims to maximize experimentation output from a system.

Examples of questions velocity metrics aim to answer are:

There are so many factors that play into these questions. Solving them is difficult and open-ended. For example, there can be different answers to these questions based on the day of the week or based on team or company culture. These questions are also only examples and many velocity-based questions are team- or company-dependent.

Thus, velocity metrics and improvements are less well-defined than the other two principles. Tracking and improving velocity can require cooperation between everyone involved in the experimentation process because many of the questions relate to the people involved and situations that arise when experimenting.

This means there are many metrics that can be tracked to understand ML experimentation velocity but those metrics depend on specifics of the experimentation process. However, there are three primary metrics that should be defined and tracked to understand velocity regardless of situation:

These three metrics sliced by the same factors mentioned in the efficiency section give further insight into system inefficiencies that may not be directly apparent when investigating resource utilization. These are especially helpful when combined with scalability and efficiency metrics from the above sections. They also provide a basis for making trade-offs (sacrificing performance in certain metrics to improve others) to optimize overall system performance.

An important note is velocity enables companies to outcompete. If a company can maximize its human resources (such as modelers) and optimize the entire experimentation process it’ll be able to develop AI faster than competitors.


That’s all for this article! I was going to tack on a part three about how the above principles can help you understand which companies will win the AI race, but I decided to post that on X instead. Click the image below if you want to check it out. If you’d prefer a written post about it instead, let me know.

If you’re interested in machine learning engineering topics, don’t forget to subscribe to get these articles directly in your inbox.

Subscribe now

See you next time!

Always be (machine) learning,

Logan

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

规模化机器学习 可扩展性 效率 速度 机器学习实验
相关文章