Unite.AI 04月22日 19:18
NVIDIA Issues Hotfix for GPU Driver’s Overheating Issue
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

英伟达最近紧急发布了一个热修复程序,以解决之前驱动程序更新导致的问题。该更新导致系统错误地报告GPU温度安全,即使散热需求悄然上升到潜在的关键水平。受影响的驱动程序576.02发布后,用户报告了各种问题,包括温度监测工具停止更新、风扇曲线异常以及GPU在标准负载下过热。虽然VBIOS通常会防止永久性GPU损坏,但持续高温仍可能导致性能下降或损坏其他组件。该问题尤其影响了从事人工智能工作流程的用户,因为他们经常长时间将硬件推至其热极限。尽管发布了热修复程序,但受影响的驱动程序仍然可以在英伟达网站上下载。

🔥英伟达发布了576.02驱动程序更新,但该更新导致GPU温度报告错误,监测工具停止更新,风扇控制失效,最终导致GPU过热。

🛠️用户在论坛上报告了各种问题,例如MSI Afterburner等工具停止更新GPU温度,以及在游戏和正常使用中出现过热现象。重启软件和系统后,只有系统重启才能恢复准确的读数。

⚠️此次更新似乎扩展了之前仅限于Optimus系统的行为,即使在非Optimus系统上,也允许GPU在空闲时进入低功耗状态,从而干扰第三方工具的温度报告。

💡虽然VBIOS通常会限制GPU的温度,但持续高温仍可能导致性能下降或损坏其他组件,尤其对于进行人工智能工作流程的用户,因为他们经常长时间将硬件推至其热极限。

Yesterday NVIDIA rushed out a critical hotfix to contain the fallout from a prior driver release that had triggered alarm across AI and gaming communities by causing systems to falsely report safe GPU temperatures – even as cooling demands quietly climbed toward potentially critical levels.

In NVIDIA's official post around the hotfix release, though only third in the list of stated fixes, the issue is cited as ‘GPU monitoring utilities may stop reporting the GPU temperature after PC wakes from sleep'.

Shortly after the affected Game Ready driver 576.02 was rolled out, a pinned thread at the Stable Diffusion sub-Reddit, titled Read to Save Your GPU!, became a resource for anecdotal issues and user-reported updates concerning the new driver. From these, and other reports around the web, some time-line of emergent problems can be established.

The first Reddit report of the bug seems to have occurred late Friday afternoon UTC, at the ZephyrusG14 subreddit, where the user fricy81 cited a post at NVIDIA forums (archived):

The user at NVIDIA forums reported that after installing the driver update, tools like MSI Afterburner and in-game monitors such as the one in Call of Duty (which generally access native system readings, much as Task Manager's GPU panel does in Windows) stopped updating GPU temperature readings, freezing at around 35-36°C.

Restarting the monitoring software had no effect, the user stated, and only a full system reboot would restore accurate readings. Tools like HWInfo and NVIDIA's own monitoring app continued to report temperatures correctly. The user emphasized that the issue occurred during normal use, not just after waking the system from sleep.

User feedback across various forums highlighted a general disruption of normal fan curve behavior and an alteration of core thermal regulation, resulting in graphics processing units idling at unexpectedly high temperatures, and alarmingly overheating under what would typically be considered standard operational loads, as detailed in this comment:

‘I could tell something was off. The weather outside was probably around 55°F / 12°C, but I was cooking alive in my room. My window was open, and yet I couldn’t feel any difference. All the fans were running at max, and temps looked fine at first—around 68°C to 72°C after gaming for a while.

‘At first, that seemed normal—until the next morning, when I realized those aren't idle temps, and the fans were still [kicking].

‘I had done some AI overclocking after fixing a few things lately, so I wasn’t sure if the values had just spiked too high. It’s happened once before after installing ASUS AI Suite 3 – the BIOS settings wouldn’t even work properly because of it.

‘Anyway, I went ahead and rolled back to an older driver for now.'

Sub-Optimal

The official release PDF for the 576.02 driver update offers some clues about changes that may have contributed to the new issues. In section 5.5, NVIDIA acknowledges that GPU temperature can be reported incorrectly on NVIDIA Optimus systems, specifically showing zero degrees when no applications are running.

Section 5.5 of the official 576.02 update notes addresses temperature-monitoring issues that seem to have affected a wider number of systems than the Optimus system. Source: https://us.download.nvidia.com/Windows/576.02/576.02-win11-win10-release-notes.pdf

The release states:

5.5 GPU Temperature Reported Incorrectly on Optimus Systems

5.5.1 Issue

On Optimus systems, temperature-reporting tools such as Speccy or GPU-Z report that the NVIDIA GPU temperature is zero when no applications are running.

5.5.2 Explanation

On Optimus systems, when the NVIDIA GPU is not being used then it is put into a low-power state. This causes temperature-reporting tools to return incorrect values. Waking up the GPU to query the temperature would result in meaningless measurements because the GPU temperature change as a result.

These tools will report accurate temperatures only when the GPU is awake and running.

NVIDIA Optimus is a GPU switching technology that toggles between integrated and discrete graphics based on application demands, in order to automatically balance performance and power consumption, designed to conserve battery life and reduce power consumption. For tasks such as gaming or HD video playback, Optimus activates the discrete GPU for better performance; during lighter activities such as web browsing, it reverts to integrated (onboard) graphics.

The update appears to have extended a behavior previously limited to Optimus systems, allowing the affected GPU to enter a low-power state while idle, even when not hosted on an Optimus system, in turn disrupting temperature reporting in third-party tools.

Risk Adjustment

In most scenarios, it’s fair to say that the graphics card's VBIOS would likely have prevented permanent GPU damage. VBIOS enforces thermal and power limits at the firmware level, independently of the driver.

Therefore even if a driver were to cause improper fan behavior or misreport temperatures, the VBIOS should still throttle performance, ramp up fan activity, or else shut down the GPU to prevent hardware failure.

That doesn’t mean the risk was trivial – sustained high temperatures can degrade performance over time or stress adjacent components; additionally, absent a common understanding that an updated driver caused a problem (not least in systems where drivers update ‘silently'), an issue of this nature could mislead a large proportion of affected users, who may attempt remedies for non-existent problems, or even potentially cause damage to their systems by applying non-relevant ‘fixes'.

The errant behavior caused by update 576.02 was particularly alarming for those engaged in artificial intelligence workflows, where high-performance hardware is routinely pushed to its thermal limits for extended durations.

The problematic 576.02 driver inspired a broader rash of complaints after its release in mid-April, despite initial reports that it offered some beneficial performance improvements. Notwithstanding the provision of the hotfix, and the level of disruption that 576.02 seems to have caused, at the time of writing it remains available for download at NVIDIA's site.

Afterglow

In terms of the fallout from the faulty update, there are numerous types of damage and or inconvenience reported: user Frankie_T9000 reported that his GPU crashed on boot due to heat buildup under the fault update, and only stabilized after undervolting. He commented ‘looks like its not permanently harmed but need to repaste asap (I have pads coming wednesday) suspect the old thermal paste was aged more by the heat buildup so im putting new paste pads.

Yesterday another user in the same thread stated: ‘Im using a custom fan curve wit msi afterburner, and it kept showing that my gpu temps were constantly at 27°C, so the fans didn't turn on, which led to overheating issues. I thought it was a me issue but after installing the previous driver it all worked out fine again. Also, the temps arent displayed correctly in taskmanager.'

Though NVIDIA (as it states persistently in each hotfix release) often provides hotfixes for particular video-games or platforms, the risk of heat damage to or around a GPU is higher for AI practitioners than for videogamers, since intensive machine learning processes such as training or sustained inference place a GPU under consistent long-term load – an event likely to be triggered only periodically in a game, which may ‘spike' into high usage for a boss-battle or a particularly demanding map section, but which is otherwise designed as a compromise between GPU exploitation and system stability.

 

Archive: https://archive.ph/ylVR1

First published Tuesday, April 22, 2025

The post NVIDIA Issues Hotfix for GPU Driver’s Overheating Issue appeared first on Unite.AI.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

英伟达 GPU 驱动 过热 修复
相关文章