MarkTechPost@AI 02月01日
Light3R-SfM: A Scalable and Efficient Feed-Forward Approach to Structure-from-Motion
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Light3R-SfM是一种全新的、完全可学习的前馈式运动结构(SfM)模型,旨在从无序图像集合中估计全局对齐的相机姿态,无需耗时的全局优化。与传统的SfM技术不同,它在潜在空间中引入了隐式全局对齐模块,实现了高效的多视图特征共享,然后再进行成对3D重建。该方法通过可扩展的注意力机制进行全局信息交换,提高了精度并减少了运行时间。相较于其他方法,Light3R-SfM在重建速度和精度上都有显著提升,为大规模数据集的SfM应用提供了更实用的解决方案。

🖼️Light3R-SfM采用前馈网络架构,直接从图像中估计相机姿态,无需传统SfM中耗时的迭代优化过程。

🔗该方法在潜在空间中引入隐式全局对齐模块,通过自注意力和交叉注意力机制实现多视图特征共享,提高了重建效率。

⏱️Light3R-SfM利用最短路径树算法构建场景图,并采用Procrustes对齐方法合并点云,显著减少了计算量,实现了快速重建。

🚀实验结果表明,Light3R-SfM在速度和精度上都优于其他前馈式方法,并且在处理大规模数据集时更具优势。

Structure-from-motion (SfM) focuses on recovering camera positions and building 3D scenes from multiple images. This process is important for tasks like 3D reconstruction and novel view synthesis. A major challenge comes from processing large image collections efficiently while maintaining accuracy. Several approaches rely on the optimization of camera poses and scene geometry. However, these have usually increased computational costs substantially, and scaling SfM for large datasets remains challenging due to the sensitivity of balancing speed, accuracy, and memory consumption.

Currently, SfM methods follow two main approaches: incremental and global. Incremental methods build 3D scenes step by step, starting from two images, while global methods align all cameras at once before reconstruction. Both rely on feature detection, matching, 3D triangulation, and optimization, leading to high computational costs and memory usage. Some learning-based methods improve accuracy but struggle with low visual overlap in images. Others attempt to reduce processing time by limiting pairwise comparisons, but optimization-based alignment remains slow and inefficient. Despite advancements, current techniques remain resource-intensive, making it difficult to scale SfM for large datasets or dynamic scenes.

To solve these issues, researchers from NVIDIA, Vector Institute, and the University of Toronto proposed Light3R-SfM, a fully learnable feed-forward Structure-from-Motion (SfM) model designed to estimate globally aligned camera poses from unordered image collections without requiring computationally expensive global optimization. Unlike conventional SfM techniques, it incorporates an implicit global alignment module in the latent space, enabling efficient multi-view feature sharing before performing pairwise 3D reconstruction. Light3R-SfM differs from Spann3R, which utilizes an explicit memory bank for online reconstruction that can drift over time, focusing on offline reconstruction from unordered images. It employs a scalable attention mechanism for global information exchange, improving accuracy while reducing runtime. Compared to MASt3R-SfM, Light3R-SfM reconstructs a 200-image scene in 33 seconds, achieving a 49× speedup over the 27-minute runtime of MASt3R-SfM.

The framework consists of five stages: encoding images into feature tokens, performing latent global alignment through self- and cross-attention, constructing a scene graph using the shortest path tree (SPT) algorithm, decoding pairwise point maps, and merging them into a globally aligned 3D reconstruction without traditional global optimization. The method reduces redundant computation by filtering low-overlap image pairs and aligns point maps using Procrustes alignment, which is computationally efficient compared to conventional bundle adjustment. 

Researchers evaluated multi-view pose estimation on the Tanks&Temples dataset, comparing their method, Light3R-SfM, with optimization-based (OPT) and feedforward-based (FFD) approaches across different view settings. Using metrics such as relative rotation and translation accuracy (RRA, RTA), absolute translation error (ATE), registration rate, and runtime on an NVIDIA V100-32GB, they found that Light3R-SfM significantly outperformed Spann3R, the only other FFD method. It achieved 145% higher RRA and 84% higher RTA while running nearly twice as fast. Although OPT methods like Colmap and Glomap offered better accuracy through bundle adjustment, they required up to 43× more runtime, making them less scalable. Unlike Spann3R, which struggled with unordered images and suffered from high computational costs due to exhaustive pairwise comparisons, Light3R-SfM demonstrated superior efficiency and accuracy, making it a more practical solution.

In summary, the proposed method replaced traditional matching and global optimization with 3D foundation models and a scalable latent alignment module. This approach reduced runtime while maintaining competitive accuracy, offering a practical alternative to optimization-based methods. However, it has limitations regarding scalability to large image collections and accuracy at tight thresholds, likely due to the low resolution of images. Despite these limitations, this method may serve as a foundation for more promising work in the area, where potential improvements would be related to scalability and accuracy improvement and more robust feature alignment techniques.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 70k+ ML SubReddit.

Meet IntellAgent: An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System (Promoted)

The post Light3R-SfM: A Scalable and Efficient Feed-Forward Approach to Structure-from-Motion appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Light3R-SfM 运动结构 3D重建 全局对齐 深度学习
相关文章