Dan Rose AI | Applied AI Blog 2024年11月26日
Top 5 ways to make better AI with less data
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了机器学习领域中五种能够减少数据需求的技术和策略,包括迁移学习、主动学习、数据质量提升、生成对抗网络和概率编程。这些方法旨在通过利用已有的知识、优化数据收集策略、提高数据质量或利用生成模型等方式,减少模型训练所需的大量数据,从而提高模型效率和降低成本。文章以具体的案例和应用场景说明了这些方法的优势和适用范围,例如利用迁移学习减少识别犬种所需的图像数量,通过主动学习聚焦于模型难以识别的样本,以及利用概率编程将领域知识融入模型等。

🐶**迁移学习**: 通过利用预训练的大型神经网络,将已学习的通用模式迁移到新的特定问题上,减少模型从头学习所需的数据量。例如,在识别犬种时,使用迁移学习可以将识别通用图像的模型迁移到识别犬种的任务中,从而减少所需的图像数量。

🔄**主动学习**: 通过选择对模型训练最有益的数据进行标注,从而提高数据利用效率。例如,如果模型难以识别德国牧羊犬,则可以利用主动学习策略自动选择这些样本进行标注,从而提升模型的识别能力。

📈**数据质量提升**: 通过提高数据质量,减少数据中的噪声和错误信号,从而降低模型训练所需的数据量。例如,确保训练数据中只有正确标注的样本,避免使用错误或模糊的样本。

🤖**生成对抗网络 (GANs)**: 通过生成对抗网络生成新的训练数据,减少对大量标注数据的需求。GANs由生成器和判别器组成,生成器生成新的数据,判别器判断数据真伪,两个网络相互竞争,最终生成高质量的合成数据。

📚**概率编程**: 通过将领域知识编码到模型中,减少模型从头学习所需的数据量。例如,在文档数据提取中,可以利用概率编程告诉模型预期数据出现在关键词的右侧,从而减少模型训练所需的数据量。

1. Transfer Learning

Transfer learning is used a lot in machine learning now since the benefits are big. The general idea is simple. You train a big neural network for purposes with a lot of data and a lot of training. When you then have a specific problem you sort of “cut the end off” the big network and train a few new layers with your own data. The big network already understands a lot of general patterns that you with transfer learning don’t have to teach the network this time. 

A good example is if you try to train a network to recognize images of different dog species. Without transfer learning you need a lot of data, maybe a 100.000 images of different dog species since the network has to learn everything from scratch. If you train a new model with transfer learning you might only need 50 images of every species.

You can read more about Transfer Learning here.

2. Active learning

Active learning is a data collection strategy that enables you to pick the data that your AI models would benefit the most from when training. Let’s stick with the dog species example. You have trained a model that can differentiate between different species but for some reason the model always has trouble identifying the german shepherds. With an active learning strategy you would automatically or at least with an established process pick out these images and send them for labelling. 

I made a longer post about how active learning works here.

3. Better data

I’ve put in a strategy here that might sound obvious but is sometimes overlooked. With better quality data you often need way less data since the AI does not have to train through the same amount of noise and wrong signals. In the media AI is often talked about as “with a lot of data you can do anything”. But in many cases making an extra effort to get rid of bad data and make sure that only correctly labeled data is used for training, makes more sense than going for more data.

4. GAN’s

GAN’s or Generative Adversarial Networks is a way to build neural networks that sounds almost futuristic in it’s design. Basically this kind of neural network is built by having two networks compete against each other in a game where one network creates new fake training data examples from the data set and the other is trying to guess what is fake and what is real data. The network building fake data is called the generator and the network trying to guess what is fake and what is real is called the discriminator. This is a deep learning approach and both networks keep improving during the game. When the generator is so good at generating fake data that the discriminator is consistently having problems separating fake from real we have a finished model. 

For GAN’s you still need a lot of data but you don’t need as labelled data and since it’s usually the labelling that is the costly part you can save time and on your data with this approach.

5. Probabilistic Programming

One of my very favorite technologies. Probabilistic programming has a lot of benefits and one of them is that you can often get away with using less data. The reason is simply that you build “priors” into your models. That means that you can code your domain knowledge into the model and let data take it from there. In more many other machine learning approaches everything has to be learned by the model from scratch no matter how obvious it is. 

A good example here is document data capture models. In many cases the data we are looking for is obvious by the keyword to the left of it. Like “ID number: #number# is a common format. With probabilistic programming you can tell the model before training that you expect the data to be to the right of the keyword. Many neural networks are taught from scratch requiring more data.

You can also read more about probabilistic programming here: https://www.danrose.ai/blog/63qy8s3vwq8p9mogsblddlti4yojon

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

机器学习 迁移学习 主动学习 GAN 概率编程
相关文章