The GitHub Blog 2024年12月18日
Breaking down CPU speed: How utilization impacts performance
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

GitHub性能工程团队通过实验观察到CPU利用率对系统性能的显著影响。随着CPU利用率的上升,延迟也会增加,这为优化系统效率提供了机会。实验表明,不同CPU型号在不同负载下的性能表现各异。通过深入研究,团队发现Intel的睿频加速技术和超线程技术在CPU利用率变化时会产生影响。此外,过高或过低的CPU利用率都会导致资源浪费或性能下降。因此,找到CPU利用率的“黄金比例”至关重要,即在保证效率的同时,避免性能显著下降。通过数据分析,可以建立数学模型来确定这个阈值,并进行相应的优化。

📈 CPU利用率上升会导致延迟增加,系统性能下降。团队通过实验发现,不同CPU型号在不同负载下的性能表现各异,需要针对性优化。

⚡️ Intel的睿频加速技术在低CPU利用率时会提升CPU频率,从而降低延迟;但随着利用率升高,频率会因热限制而降低,导致性能下降。超线程技术在CPU利用率较高时,也无法充分利用虚拟核心,从而影响性能。

🎯 找到CPU利用率的“黄金比例”至关重要,即在保证效率的同时,避免性能显著下降。过高或过低的CPU利用率都会导致资源浪费或性能下降。可以通过数据分析,建立数学模型来确定这个阈值,并进行相应的优化。

Introduction ⛵

The GitHub Performance Engineering team regularly conducts experiments to observe how our systems perform under varying load conditions. A consistent pattern in these experiments is the significant impact of CPU utilization on system performance. We’ve observed that as CPU utilization rises, it can lead to increased latency, which provides an opportunity to optimize system efficiency. Addressing this challenge allows us to maintain performance levels while reducing the need for additional machines, ultimately preventing inefficiencies.

Although we recognized the correlation between higher CPU utilization and increased latency, we saw an opportunity to explore the specific thresholds and impacts at various stages in greater detail. With a diverse set of instance types powered by different CPU families, we focused on understanding the unique performance characteristics of each CPU model. This deeper insight empowered us to make smarter, data-driven decisions, enabling us to provision our infrastructure with greater efficiency and confidence.

With these goals in mind, we embarked on a new journey of exploration and experimentation to uncover these insights.

Experiment setup ?

Collecting accurate data for this type of experiment was no easy feat. We needed to gather data from workloads that were as close to our production as possible, while also capturing how the system behaves under different phases of load. Since CPU usage patterns vary across workloads, we focused primarily on our flagship workloads. However, increasing the load could introduce small performance discrepancies, so our goal was to minimize disruption for our users.

Fortunately, a year ago, the Performance Engineering team developed an environment designed to meet these requirements, codenamed Large Unicorn Collider (LUC). This environment operates within a small portion of our Kubernetes clusters, mirroring the same architecture and configuration as our flagship workloads. It also has the flexibility to be hosted on dedicated machines, preventing interference from or with other workloads. Typically, the LUC environment remains idle, but when needed, we can direct a small, adjustable amount of traffic towards it. Activating or deactivating this traffic takes only seconds, allowing us to react quickly if performance concerns arise.

To accurately assess the impact of CPU utilization, we first established a baseline by sending moderate production traffic to a LUC Kubernetes pod hosted on one of its dedicated machines. This provided us with a benchmark for comparison. Importantly, the number of requests handled by the LUC pods remained constant throughout the experiment, ensuring consistent CPU load over time.

Once the baseline was set, we gradually increased CPU utilization using a tool called “stress,” which artificially occupies a specified number of CPU cores by running random processing tasks. Each instance type has a different number of CPU cores, so we adjusted the steps accordingly. However, the common factor across all instances was the total CPU utilization.

Note: It’s important to recognize that this is not a direct 1:1 comparison to the load generated by actual production workloads. The stress tool continuously runs mathematical operations, while our production workloads involve I/O operations and interrupts, which place different demands on system resources. Nevertheless, this approach still offers valuable insights into how our CPUs perform under load.

With the environment set up and our plan in place, we proceeded to collect as much data as possible to analyze the impact.

Results ?

With our experiment setup finalized, let’s examine the data we gathered. As previously mentioned, we repeated the process across different instance types. Each instance type showed unique behavior and varying thresholds where performance started to decline.

As anticipated, CPU time increased for all instance types as CPU utilization rose. The graph below illustrates the CPU time per request as CPU utilization increases.

CPU time per request vs CPU utilization

The latency differences between instance types are expected due to the variations in CPU models. Focusing on the percentage increase in latency may provide more meaningful insights.

Latency percentage increase vs CPU utilization

In both graphs, one line stands out by deviating more than the others. We’ll examine this case in detail shortly.

Turbo Boost effect

An interesting observation is how CPU frequency changes as utilization increases, which can be attributed to Intel’s Turbo Boost Technology. Since all the instances we used are equipped with Intel CPUs, the impact of Turbo Boost is noticeable across all of them. In the graph below, you can see how the CPU frequency decreases as the CPU utilization increases. The red arrows are showing the CPU utilization level.

CPU Cores Frequency

When CPU utilization remains at lower levels (around 30% or below), we benefit from increased core frequencies, leading to faster CPU times and, consequently, lower overall latency. However, as the demand for more CPU cores rises and utilization increases, we are likely to reach the CPU’s thermal and power limits, causing frequencies to decrease. In essence, lower CPU utilization results in better performance, while higher utilization leads to a decline in performance. For instance, a workload running on a specific node with approximately 30% CPU utilization will report faster response times compared to the same workload on the same VM when CPU utilization exceeds 50%.

Hyper-Threading

Variations in CPU frequency are not the only factors influencing performance changes. All our nodes have Hyper-Threading enabled, an Intel technology that allows a single physical CPU core to operate as two virtual cores. Although there is only one physical core, the Linux kernel recognizes it as two virtual CPU cores. The kernel attempts to distribute the CPU load across these cores, aiming to keep only one hardware thread (virtual core) busy per physical core. This approach is effective until we reach a certain level of CPU utilization. Beyond this threshold, we cannot fully utilize both virtual CPU cores, resulting in reduced performance compared to normal operation.

Finding the “Golden Ratio” of CPU utilization

Underutilized nodes lead to wasted resources, power, and space in our data centers, while nodes that are excessively utilized also create inefficiencies. As noted, higher CPU utilization results in decreased performance, which can give a misleading impression that additional resources are necessary, resulting in a cycle of over-provisioning. This issue is particularly pronounced with blocking workloads that do not follow an asynchronous model. As CPU performance deteriorates, each process can manage fewer tasks per second, making existing capacity inadequate. To achieve the optimal balance—the “Golden Ratio” of CPU utilization—we must identify a threshold where CPU utilization is sufficiently high to ensure efficiency without significantly impairing performance. Striving to keep our nodes near this threshold will enable us to utilize our current hardware more effectively alongside our existing software.

Since we already have experimental data demonstrating how CPU time increases with rising utilization, we can develop a mathematical model to identify this threshold. First, we need to determine what percentage of CPU time degradation is acceptable for our specific use case. This may depend on user expectations or performance Service Level Agreements (SLAs). Once we establish this threshold, it will help us select a level of CPU utilization that remains within acceptable limits.

We can plot the CPU utilization vs. CPU time (latency) and find the point where:

A specific example derived from the data above can be illustrated in the following graph.

Percentage Increase in P50 Latency vs CPU Utilization

In this example, we aim to achieve less than 40% CPU time degradation, which would correspond to a CPU utilization of 61% on the specific instance.

Outlier case

As previously mentioned, there was a specific instance that displayed some outlying data points. Our experiment confirmed an already recognized issue where certain instances were not achieving their advertised maximum Turbo Boost CPU frequency. Instead, we observed steady CPU frequencies that fell below the maximum advertised value under low CPU utilization. In the example below, you can see an instance from a CPU family that advertises Turbo Boost frequencies above 3 GHz, but it is only reporting a maximum CPU frequency of 2.8 GHz.

CPU cores frequency

This issue turned out to be caused by a disabled CPU C-state, which prevented the CPU cores from halting even when they were not in use. As a result, these cores were perceived as “busy” by the turbo driver, limiting our ability to take advantage of Turbo Boost benefits with higher CPU frequencies. By enabling the C-state and allowing for optimization and power reduction during idle mode, we observed the expected Turbo Boost behavior. This change had an immediate impact on the CPU time spent by our test workloads. The images below illustrate the prompt changes in CPU frequencies and latency reported following the C-state adjustment.

CPU cores frequency
P50 CPU time on a request

Upon re-evaluating the percentage change in CPU time, we now observe similar behavior across all instances.

Percentage Increase in P50 Latency vs CPU Utilization

Wrap-up

As we anticipated many of these insights, our objective was to validate our theories using data from our complex system. While we confirmed that performance lowers as CPU utilization increases across different CPU families, by identifying optimal CPU utilization thresholds, we can achieve a better balance between performance and efficiency, ensuring that our infrastructure remains both cost-effective and high performing. Going forward, these insights will inform us of our resource provisioning strategies and help us maximize the effectiveness of our hardware investments.


Thank you for sticking with us until the end!! A special shout-out to @adrmike, @schlubbi, @terrorobe, the @github/compute-platform and finally the @github/performance-engineering team for their invaluable assistance throughout these experiments, data analysis, and for reviewing the content for accuracy and consistency. ❤️

The post Breaking down CPU speed: How utilization impacts performance appeared first on The GitHub Blog.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

CPU利用率 系统性能 Intel睿频 超线程 性能优化
相关文章