MarkTechPost@AI 2024年10月20日
This AI Paper Explores If Human Visual Perception can Help Computer Vision Models Outperform in Generalized Tasks
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨人类视觉感知与计算机视觉模型的关系,研究人员通过多种实验和分析,发现人类视觉感知对齐的模型在多种视觉任务中表现良好,但也存在过拟合和偏差传播等问题

🧠人类具有非凡的感知判断能力,将计算机视觉模型与之对齐可提升模型性能。如在场景布局、主体位置等方面,使模型对这些属性更敏感,更具人类特征

📄MIT和UC Berkeley的研究人员深入分析此问题,他们的论文研究人类视觉感知对齐模型在各种下游视觉任务中的表现,对先进模型进行微调并评估

🎯作者使用带有人工相似性判断注释的图像三元组数据集,制定目标函数以捕捉空间表示,并用多种先进的Vision Transformer模型在该数据上进行微调

👍人类对齐模型在多数视觉任务中表现良好,如在语义分割、深度估计、对象计数等任务中优于基础模型,但在分类任务上表现不佳

🔍研究还探讨了训练数据与训练方法的作用,发现NIGHTS数据集影响最大,其他数据集因无法捕获所需的中级感知特征而效果不佳

Human beings possess innate extraordinary perceptual judgments, and when computer vision models are aligned with them, model’s performance can be improved manifold. Various attributes such as scene layout, subject location, camera pose, color, perspective, and semantics help us have a clear picture of the world and objects within. The alignment of vision models with visual perception makes them sensitive to these attributes and more human-like. While it has been established that molding vision models along the lines of human perception helps attain specific goals in certain contexts, such as image generation, their impact in general-purpose roles is yet to be ascertained. Inferences drawn from research until now are nuanced with naive incorporation of human perception abilities, badly harming models and distorting representations. It is also argued whether the model actually matters or whether the results depend upon objective function and training data. Furthermore, labels’ sensitivity and implications make the puzzle more complicated. All these factors further complicate understanding human perceptual abilities regarding vision tasks.

Researchers from MIT and UC Berkeley analyze this question in depth. Their paper “When Does Perceptual Alignment Benefit Vision Representations?” investigates how a human vision perceptual aligned model performs on various downstream visual tasks. The authors finetuned state-of-the-art models ViTs on human similarity judgments for image triplets and evaluated them across standard vision benchmarks. They introduce the idea of a second pretraining stage, which aligns the feature representations from large vision models with human judgments before applying them to downstream tasks. 

To understand this further, we first discuss the image triplets mentioned above. The authors used the renowned synthetic NIGHTS dataset with image triplets annotated with forced choice human similarity judgments where humans chose two images with the highest similarity to the first image. They formulate a patch alignment objective function to catch spatial representations present in patch tokens and translate visual attributes from global annotations; instead of computing the loss just between global CLS tokens of Vision Transformer, they focused CLS and pooled patch embeddings of ViT for this purpose to optimize local patch features jointly with the global image label.After this, various state-of-the-art Vision Transformer models, such as DINO, CLIP, etc, were finetuned on the above data using Low-Rank Adaptation (LoRA). The authors also incorporated synthetic images in triplets with SynCLR to compute the performance delta.

These models performed better in vision tasks than the base Vision Transformers. For Dense prediction tasks, human-aligned models outperformed base models in over  75 % of the cases in case of both semantic segmentation and depth estimation. Moving on in the realm of generative vision and LLMs, task of Retrieval-Augmented Generation were checked by humanizing a vision language model. Results again favored prompts retrieved by human-aligned models as they boosted classification accuracy across domains. Further, in the task of object counting, these modified models outperformed the base in more than 95 % of the cases. A similar trend persists in instance retrieval. These models failed on classification tasks due to their high level of semantic understanding.

The authors also addressed whether training data had a more significant role than the training method. For this purpose, more datasets with image triplets were considered. The results were astonishing, with the NIGHTS dataset offering the most considerable impact and the rest barely affected. The perceptual cues captured in NIGHTS play a crucial role in this with its features like style, pose, color, and object count. Others failed due to the inability to capture required mid-level perceptual features.

Overall, human-aligned vision models performed well in most cases. However, these models are prone to overfitting and bias propagation. Thus, if the quality and diversity of human annotation are ensured, visual intelligence could be taken a notch above.


Check out the Paper, GitHub, and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit.

[Upcoming Live Webinar- Oct 29, 2024] The Best Platform for Serving Fine-Tuned Models: Predibase Inference Engine (Promoted)

The post This AI Paper Explores If Human Visual Perception can Help Computer Vision Models Outperform in Generalized Tasks appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

人类视觉感知 计算机视觉模型 视觉任务 训练数据 模型性能
相关文章