Huawei Debuts AI Inference Tech With China UnionPay, Promises 90% Cut in First-Token Latency

钛媒体：引领未来商业与生活新知 16小时前

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

华为推出了一项名为UCM Inference Memory Data Manager的新AI推理技术，旨在解决当前AI模型推理阶段存在的延迟高、成本高以及商业化落地难的问题。该技术通过缓存已处理结果并从高性能共享存储中检索，避免重复计算，从而显著提升推理速度和效率。华为表示，UCM技术可以将“首个token”延迟降低高达90%，在长序列场景下每秒处理的token数量提升22倍，并降低每个token的成本，且无需大规模硬件升级。目前，该技术已在中国银联的金融场景中进行试点，并取得了显著的成效。

💡 UCM技术的核心在于通过缓存已处理结果，并从高性能共享存储中检索，来优化AI推理过程，避免重复计算，从而有效降低延迟和成本。这种方法特别适用于需要处理大量数据的长序列场景。

🚀 该技术通过三个关键模块实现：连接器（与常用推理引擎集成）、加速器库（管理KV Cache）和适配器（加速访问专业共享存储）。这些模块协同工作，旨在解决AI推理过程中“慢”和“贵”的行业痛点。

💰 华为宣称UCM技术能够大幅降低AI推理的成本。在与中国银联的试点中，模型推理时间从600秒缩短到10秒以内，提升了50倍，并且分类准确率也从10%以下提升到80%，显示出其在降低成本和提高性能方面的巨大潜力。

🌐 面对中国AI行业在延迟、输出速度和上下文窗口方面与国际领先模型存在的差距，华为计划于九月开源UCM技术，使其兼容多种推理引擎、存储系统和硬件供应商，以期推动行业建立AI推理加速的通用标准。

📈 华为指出，AI推理计算的需求已超过训练需求，占总工作负载的58.5%。UCM技术的推出，正是为了满足这一日益增长的推理需求，并提升中国AI产业的整体竞争力。

AI-generated image

AsianFin — Huawei has unveiled a new AI inference technology designed to slash latency, lower costs and boost the commercial viability of large AI models, as demand shifts from training to inference workloads.

The system, called UCM Inference Memory Data Manager, aims to improve the speed and efficiency of AI by caching previously processed results and retrieving them from high-performance shared storage rather than recalculating from scratch. Huawei says the approach can cut “first-token” latency by up to 90%, increase tokens processed per second by as much as 22 times in long-sequence scenarios, and reduce per-token costs — all without major new hardware investments.

UCM consists of three key modules: connectors that integrate with popular inference engines, an accelerator library for hierarchical KV Cache management, and an adapter that speeds up access to professional shared storage. By coordinating inference frameworks, computing power, and storage, Huawei says the system addresses industry pain points of “slow” and “expensive” inference.

Huawei is piloting UCM with China UnionPay in high-frequency financial scenarios, where response time and accuracy are critical. UnionPay reported that using UCM cut model inference times for customer-service classification from 600 seconds to under 10 seconds — a 50x improvement — while boosting classification accuracy from under 10% to 80%.

According to Huawei, demand for inference computing power now exceeds training demand, accounting for 58.5% of workloads. But China’s AI sector faces higher latency, slower output speeds and smaller context windows than leading overseas models, partly due to lower infrastructure investment and limited access to advanced chips.

Huawei plans to open source UCM in September, making it compatible with multiple inference engines, storage systems, and hardware vendors. The company says it hopes to rally industry players around common standards for AI inference acceleration. scanning headlines.

更多精彩内容，关注钛媒体微信号（ID：taimeiti），或者下载钛媒体App

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签