ΑΙhub 05月07日 18:29
#AAAI2025 outstanding paper – DivShift: Exploring domain-specific distribution shift in large-scale, volunteer-collected biodiversity datasets
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了公民科学平台(如iNaturalist)数据中存在的各种偏差对深度学习模型性能的影响。研究提出了DivShift框架,用于量化空间、时间、分类、观察者行为和社会政治等多种偏差。通过构建DivShift-NAWC数据集,研究人员分析了这些偏差对植物物种识别模型的影响,并提出了针对性的建议,以改进模型训练,从而提高模型在生物多样性保护中的应用价值。

🌱 公民科学数据存在多种偏差:包括空间、时间、分类、观察者行为和社会政治等方面的偏差。这些偏差会导致AI模型在训练过程中出现问题,影响其在不同地区和物种上的准确性。

💡 DivShift框架用于量化偏差的影响:该框架将数据划分为不同的子集,并使用Jensen-Shannon Divergence(JSD)来衡量这些子集之间的差异。通过将这些差异与模型的性能进行比较,可以了解数据偏差对模型的影响。

🏞️ DivShift-NAWC数据集的构建:该数据集包含来自iNaturalist的近750万张植物图像,并根据空间、时间、分类、观察者参与度和社会政治因素进行了划分。这使得研究人员能够深入研究不同偏差对模型性能的影响。

🔍 研究发现偏差对模型性能的影响:研究表明,模型在城市地区表现更好,但在荒野地区表现较差;持续的观测数据比短期活动的数据更有利于模型性能;平衡数据集可以提高对稀有物种的检测,但会降低对常见物种的检测;高频观察者的贡献可以提高模型准确性,而偶尔用户的贡献则会产生负面影响。

✅ 针对偏差的建议:研究提出了针对不同偏差的建议,例如,在荒野地区补充公民数据,鼓励持续数据收集,根据保护目标选择数据整理策略,优先考虑高频观察者的贡献,并在低数据密度地区实施额外的验证过程。

Citizen science platforms like iNaturalist have increased in popularity, fueling the rapid development of biodiversity foundation models. However, such data are inherently biased, and are collected in an opportunistic manner that often skews toward certain locations, times, species, observer experience levels, and states.

Our work, titled “DivShift: Exploring Domain-Specific Distribution Shifts in Large-Scale, Volunteer-Collected Biodiversity Datasets,” tackles the challenge of quantifying the impacts of these biases on deep learning model performance.

Biases in citizen science data

Biases present in biodiversity data include spatial bias, temporal bias, taxonomic bias, observer behavior bias, and sociopolitical bias.

AI models typically assume training data to be independent and identically distributed (i.i.d.). But biodiversity data collected by volunteers defies these assumptions. For instance, urban areas or charismatic species like blooming flowers receive disproportionate attention. These biases have been well-documented in ecology literature but are often underlooked or ignored in the development of deep learning models from these data. Such uneven sampling can limit a model’s accuracy in less-visited regions and for less charismatic species. Furthermore, without addressing these biases, models trained on citizen science data risk misleading conservation efforts and failing where they’re most needed.

The DivShift framework and DivShift-NAWC dataset

The Diversity Shift (DivShift) Framework (a) quantifies impacts of domain-specific biases by first partitioning data into partitions PA and PB using expert-verified types of bias. Bias impacts are then quantified by measuring the accuracy of models trained on PA train using PA test and PB test which is further compared to (b) the distribution shift between labels in PA train to labels in PA test and PB test using the Jensen-Shannon Divergence (JSD).

We introduce DivShift, a framework that casts biases as “domain-specific distribution shifts” to analyze the impacts of data biases on AI model performance. From the true distribution of biodiversity, J, our data, D of observations from volunteer observers has a biased sampling process. We partition D into PA and PB by some known bias in the data, and we measure the difference between the partitions via the Jensen-Shannon divergence (in this case, we use the species label distribution). Then, we compare this measure to deep learning model performance. This framework gives insight into how models generalize when moving from data‑rich to data‑sparse regions, plus in-domain to out-domain applications, and helps quantify how sampling bias skews accuracy, especially across sensitive taxa or geographic subgroups.

(a) Density plot of the DivShift-NAWC’s iNaturalist observations. Observations are skewed to U.S. and coastal states. (b) DivShift-NAWC spans a diverse set of habitats and ecosystems, (c) along with climates.(d) DivShift-NAWC observations are concentrated in human-modified areas [2].

We pair this framework with the DivShift- North American West Coast (NAWC) dataset. It comprises nearly 7.5 million plant images from iNaturalist spanning the North American West Coast. Critically, DivShift-NAWC partitions these images along five axes of documented bias:

    Spatial: Urban vs. wilderness areas using the Human Footprint IndexTemporal: Year-round collection vs. episodic events (e.g., bioblitzes) using the City Nature ChallengeTaxonomic: Long-tailed vs. balanced training dataObserver engagement: Engaged vs. casual observers using observation quantitiesSociopolitical: State boundaries.

This careful partitioning allows a detailed exploration of biases across key ecological and societal dimensions.

Key insights from our case study

We apply the DivShift framework to the DivShift-NAWC dataset for fine-grained visual classification of plant species. We run supervised learning on a ResNet-18 with standard training parameters and measure Top-1 accuracy for eight different categories.
(a) Performance skews to habited areas, but (b) the wildest species are left behind

Spatial Bias: Models performed best in urban, densely-observed environments but struggled significantly in wilderness areas due to lack of training data.

Temporal Bias: Continuous year-round observations led to stronger model performance than data from focused, short-term bioblitzes.

Taxonomic Bias: Balancing datasets improved detection of rare species but degraded performance for common species, highlighting a trade-off depending on conservation goals.

Observer Engagement: Data from highly engaged, frequent observers increased model accuracy. Conversely, contributions from occasional users negatively affected performance.

Sociopolitical Bias: Geographical and ecological distance correlated along with accuracy. However, this correlation differed depending on training state and data density.

Recommendations

Our findings translate into recommendations for training biodiversity models:

Limitations, future directions, and conclusion

DivShift-NAWC doesn’t capture all possible biases (such as fine-scale geographic or intersectional biases), and does not not dive deeply into confounding variables in the bias partitions. Future work aims to incorporate unsupervised learning and additional environmental metadata to further refine bias detection and mitigation strategies, as well as address label availability and quality.

With biodiversity loss accelerating globally, accurate and equitable AI tools are urgently needed. Our DivShift framework offers a practical way to understand and counteract biases in volunteer-collected biodiversity datasets, enabling more reliable ecological models. We believe these insights and tools will help maximize the enormous potential of citizen science and machine learning in biodiversity conservation.

AAAI proceedings | Extended paper | Dataset | Code


This work won the AAAI-25 outstanding paper award – AI for social alignment track.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

公民科学 AI模型 数据偏差 生物多样性
相关文章