arXiv:2507.03176v1 Announce Type: cross Abstract: Deep learning (DL)-based general circulation models (GCMs) are emerging as fast simulators, yet their ability to replicate extreme events outside their training range remains unknown. Here, we evaluate two such models -- the hybrid Neural General Circulation Model (NGCM) and purely data-driven Deep Learning Earth System Model (DL\textit{ESy}M) -- against a conventional high-resolution land-atmosphere model (HiRAM) in simulating land heatwaves and coldwaves. All models are forced with observed sea surface temperatures and sea ice over 1900-2020, focusing on the out-of-sample early-20th-century period (1900-1960). Both DL models generalize successfully to unseen climate conditions, broadly reproducing the frequency and spatial patterns of heatwave and cold wave events during 1900-1960 with skill comparable to HiRAM. An exception is over portions of North Asia and North America, where all models perform poorly during 1940-1960. Due to excessive temperature autocorrelation, DL\textit{ESy}M tends to overestimate heatwave and cold wave frequencies, whereas the physics-DL hybrid NGCM exhibits persistence more similar to HiRAM.