关键词:硕士生答辩

编者按

2025年5月,前沿计算研究中心2022级硕士毕业论文答辩工作顺利完成。根据计算机学院的工作部署,中心有序安排本次答辩工作,中心执行主任邓小铁教授、副主任王亦洲教授、董豪老师和王鹤老师作为答辩评委参加。

来自北京大学计算机学院的4位2022级硕士生进行了毕业论文报告并顺利通过答辩,祝贺!

答辩合影

计算机学院2022级硕士生毕业论文

Affordance Representation for Robot Manipulation: One-Shot Visual-Prompt Learning and Geometric Control

答辩学生:Taewhan Kim

指导老师:董豪 助理教授

Conventional affordance pipelines rely on dense 3D point-cloud sampling or repeated robot trials, making them expensive to annotate, difficult to scale across object categories, and fragile when models trained in simulation are deployed in the physical world. These methods also struggle with limited dataset diversity and a persistent sim-to-real gap that degrades performance under real-world lighting, occlusion, and sensory noise. As a result, robots often fail to identify actionable regions on articulated objects—such as drawer fronts or door flaps—without extensive task- or robot-specific tuning. 

This thesis reframes affordance perception as a one-shot 2D segmentation problem driven purely by visual prompts. A pre-trained Vision Transformer is lightly fine-tuned on a 9.9k image custom dataset that fuses simulated and real scenes, allowing the model to predict manipulable regions from a single RGB image and a single annotated prompt, with no robot-specific data. The predicted mask is fused with surface-normal geometry; a gradient-filtered contact-point selector and a quaternion-corrected controller then convert perception into stable push, pull, or rotate motions on articulated objects. This modular pipeline achieves real-time inference, requires minimal supervision, and generalizes to unseen object categories in both simulation and real-robot trials, outperforming state-of-the-art baselines in segmentation quality and task success.

The work contributes: (1) Minimalist One-Shot Affordance Segmentation Framework: We propose a streamlined visual framework that predicts affordance masks using a fine-tuned foundation vision model and only one or two prompt images. This approach eliminates the need for large-scale robot interaction datasets or extensive annotation and supports real-time inference on novel scenes. (2) No Robot-Specific Training or Interaction Required: Unlike many existing affordance-learning methods that depend on physical interaction data or robot specific tuning, our framework operates solely on visual input, removing the need for robot-specific datasets or retraining. This enables rapid deployment across different platforms and environments. (3) Synthetic Dataset for Scalable Affordance Learning: We curate and release a dataset of 9.9k affordance-labeled RGB images covering a wide range of common indoor objects. Despite its compact size, the dataset proves sufficient for generalization to real-world scenes and objects, demonstrating that affordance understanding can emerge from scalable synthetic supervision.

计算机学院2022级硕士生毕业论文

Sim-to-real RL for Generalizable Articulated Object Manipulation

答辩学生:Do Tan Dung

指导老师:王鹤 助理教授

For the Robotics community, Robotics Manipulation tasks such as grasping, pushing, and tossing represent some of the most challenging and popular Robotics tasks. With the rapid development of Locomotion pipelines in recent years, state-of-the-art Manipulation frameworks play a vital role in the realization of complex loco-manipulation systems, which are the ultimate solutions to many real-world problems in both household and industry settings. Among the Manipulation tasks, Articulated Object Manipulation poses a unique challenge compared to Rigid Object Manipulation. The delicate skeletons of these objects require an understanding of their kinematics structures and compliant robot motions to ensure safe and effective interaction.

Most state-of-the-art frameworks belong to either pure vision-based methods or Reinforcement Learning methods. Pure vision-based methods often learn to predict the kinematics structure of the object and leverage heuristic-based rules to execute robot actions. These methods often overlook the robot-object interaction during the execution phase, hindering adaptive behaviors. Methods backed by Reinforcement Learning often follow the state-to-vision distillation process. A state-based policy is first trained in the simulation, followed by a teacher-to-student distillation to achieve a vision-based policy that can be transferred to the real world. However, these methods inherently introduce large sim-to-real gaps as well as restrain the inference frequency of the framework to be the vision-backbone's low frequency.

To address these problems, we present a novel RL-based pipeline equipped with variable impedance control and motion adaptation leveraging observation history for generalizable articulated object manipulation, focusing on smooth and dexterous motion during zero-shot sim-to-real transfer. To mitigate the sim-to-real gap, our pipeline diminishes reliance on vision by not leveraging the vision data feature (RGBD/pointcloud) directly as policy input but rather extracting useful low-dimensional data first via off-the-shelf modules. Additionally, we experience less sim-to-real gap by inferring object motion and its intrinsic properties via observation history as well as utilizing impedance control both in the simulation and in the real world. Furthermore, we develop a well-designed training setting with great randomization and a specialized reward system (task-aware and motion-aware) that enables multi-staged, end-to-end manipulation without heuristic motion planning. To the best of our knowledge, our policy is the first to report 84% success rate in the real world via extensive experiments with various unseen objects.

计算机学院2022级硕士生毕业论文

Generalizable Quadrupedal Whole-body Grasping

答辩学生:Javokhirbek Rajabov

指导老师:王鹤 助理教授

Accurate whole-body grasping in mobile robots remains a fundamental challenge due to the need for seamless coordination between locomotion and manipulation. This thesis presents a robust and generalizable whole-body loco-manipulation framework that tightly integrates reinforcement learning with model-based planning to leverage spatial awareness for real-world performance. Designed around a legged robot equipped with a single arm-mounted camera, the system features a modular architecture consisting of a robust low-level policy trained via reinforcement learning for 5D command execution, and a grasp-aware high-level planner guided by a novel metric: the Generalized Oriented Reachability Map (GORM). This approach enables dynamic task execution across a large vertical workspace—from floor level to above body height—including complex scenarios such as grasping transparent or irregularly shaped objects. The system achieves a state-of-the-art one-shot grasping accuracy of 89% in real-world experiments. Extensive evaluations in both simulation and hardware validate the framework's effectiveness, robustness, and adaptability, offering valuable insights for advancing general-purpose, real-world whole-body loco-manipulation in legged robotic systems.

计算机学院2022级硕士生毕业论文

高分辨率、多摄像头实时运动分析系统

答辩学生:冯涌

指导老师:王亦洲 教授

随着计算机视觉和人工智能技术的迅速发展,运动视频分析已成为体育训练教学、健康评估及智慧场馆建设领域的重要研究方向。通过对视频中人体动作和物体运动的分析,研究者可以在运动健康评估、体育训练辅助以及智慧场馆建设等多个应用场景中实现突破。特别是在体育运动中,实时的动作捕捉和分析不仅可以帮助运动员提升技术水平,还能为运动爱好者提供科学的运动指导。因此,开发一套高效、全面且可扩展的动作分析系统具有重要的研究价值和实际意义。

然而,个人视角难以覆盖多人团队运动场景,场馆层面的全局信息统计与战术策略分析难以实现,限制了系统在智慧场馆建设中的应用深度与广度。此外,针对高速运动和多人场景的实时视频数据流采集系统,现有研究在高精度与高实时性的平衡上仍面临巨大挑战,多数方案缺乏通用性,难以在不同运动项目和网络环境下保持稳定性能。为此,本论文提出了一种基于高分辨率、多摄像头的实时运动分析系统,整体平台由信息采集车、工业摄像头及高性能服务器构成,结合5G低时延传输技术,以及多视角关键点匹配、三维人体姿态估计、目标检测与跟踪等先进算法,实现对复杂动态场景下人体运动的高效、精确分析。

在系统构建过程中,本文围绕实时视频传输优化与多视角数据融合两大核心问题,从系统设计、算法开发与性能评估等角度深入研究设备标定、时间同步、空间校准及网络与算法协同优化等关键技术,全面刻画系统性能边界。通过在篮球、羽毛球等多种运动场景下的实验验证,系统在姿态估计精度和目标检测性能上表现优异,并能在稀疏视角及弱网络环境下保持高稳定性和准确率,展现出良好的可扩展性和应用前景。通过引入软硬件协同的新视角,本研究为体育教学数字化建设和专业运动数据的深度挖掘提供了可靠、高效且易拓展的技术方案,为更多场景下的体育视频采集分析系统设计奠定了坚实基础。

CFCS近期动态

新闻 | 前沿计算研究中心2021级本科毕业论文答辩顺利完成

祝贺 | 王鹤博士入选《麻省理工科技评论》“35岁以下科技创新35人”中国区榜单

回顾 | 南洋理工大学“图灵人工智能学者项目”师生访问北京大学图灵班

静5杰出讲座回顾 | 周红院长谈协同发展准确、高效与创造性智能

—   版权声明  —

本微信公众号所有内容,由北京大学前沿计算研究中心微信自身创作、收集的文字、图片和音视频资料,版权属北京大学前沿计算研究中心微信所有;从公开渠道收集、整理及授权转载的文字、图片和音视频资料,版权属原作者。本公众号内容原作者如不愿意在本号刊登内容,请及时通知本号,予以删除。

内容中包含的图片若涉及版权问题,请及时与我们联系删除