Dan Rose AI | Applied AI Blog 2024年11月26日
What is synthetic data for artificial intelligence?
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨AI中合成数据的巨大潜力,它能改善隐私、降低偏差、提高模型准确性。介绍了合成数据的类型、应用场景及优缺点。

💻合成数据非实际观察所得,由人或算法创建,目标是代表AI应作用的世界。

📜合成文本可用于语言和文本AI,通过语言模型生成类似真实世界的文本。

🖼合成图像可通过文本提示生成,如DALL-E Mini,可用于训练AI模型。

📊合成表格数据在医疗保健中受欢迎,可解决数据问题并保护隐私。

🚗合成世界模型可用于实验AI解决方案,如自动驾驶汽车的开发。

This article is a cutout of my forthcoming book that you can sign up for here: https://www.danrose.ai/book.

Synthetic data in AI is probably the subject I think the most about currently, to be honest. It has enormous potential to improve privacy, lower bias and improve model accuracy simultaneously in a giant technological leap in the coming years. Gartner even stated, "By 2024, 60% of the data used for the development of AI and analytics projects will be synthetically generated.". That is a game-changer considering that many people working with AI today haven't even started to adopt this technology. 

Synthetic data is data but not actual observations of the world. It is fake data either created by humans or algorithms. It is created artificially or synthetically, but the goal is the same as real data - To represent the world in which the AI is supposed to function. The idea that data for training AI models should accurately represent the world is still a means to the end. Ultimately, the goal of building AI is models that accurately predict to provide a good user experience.

Types of synthetic data

Depending on the data type, text, images and tabular data, there are different approaches and use cases.

Synthetic texts

For language and text AI, you can generate synthetic texts that look like those you would find in the real world. It might even look like gibberish to a human, but if it does the job of representing the world when used for training data, that's good enough. 

I have implemented that approach before in a text classification case. I chose this approach because the data could only be stored for three months, making it hard to keep up with seasonal-specific signals. I took the real data that I fed to a language model and fine-tuned the model so that it could produce similar data to real data. We could then generate unlimited data for each label without personal data to train the AI models.

Synthetic images

For images, it's possible to use a text-to-image model that can create synthetic images simply by being prompted by a user with a text. The most famous version of this is NVIDIAs DALL-E 2 model that produces amazingly realistic pictures. An open-source version, available on HuggingFace, called DALL-E Mini, can be tried for free here: https://huggingface.co/spaces/dalle-mini/dalle-mini. You can prompt the model with a short text like "squared strawberry", and you get nine attempts from the model to produce an image of a squared strawberry.

As the model is open-source, you can also download the model and use it for your projects.

The images produced by DALL-E Mini might not be photo-realistic, but it's still good enough to train AI models. 

You can try it yourself. Go to the DALL-E Mini and query the model to make images of bananas and apples. Use sentences such as "Banana on table" or "Banana on random background". Do the same with apple until you have 30 or so images of each. You can now upload these images to Teachable Machine to make a banana vs apple recogniser. I promise it will work. If it does not impress you just a tiny bit that you can build AI to recognise objects from purely synthetic images, then I don't know what does.

The use cases here are many. You can synthetically create objects you expect but have not seen in the training data. You can also bring ordinary objects to random backgrounds to make sure you cover unknown scenarios. That will also increase the quality of the models as a change in environment will matter less.

Synthetic tabular data

Tabular data is also possible to generate synthetically. That is popular in healthcare as healthcare is very vulnerable to data issues. Besides the endless combination of scenarios with different diseases and medicine interacting, there's also the privacy issue. Data from one patient's history of diagnostics and medication can be so unique that it can identify individuals. By generating synthetics versions of the actual data, the data can be extended to cover rare scenarios better and anonymise the data. That makes it easy to share between researchers and medical experts.

Models of the world

With synthetic models of the world, we can also experiment with AI solutions before we release them and teach them to become better at a fraction of the cost. Self-driving cars are a perfect use case for this. Self-driving cars can be developed faster and safer by building a synthetic model of the world close to the real world with physics and random scenarios. Many companies building self-driving cars today use models built in the engine Unity, initially intended for computer game development. Cars can try, crash and improve with no humans at risk in a virtual world millions of times before being released.  

The good and the bad of synthetic data

The benefits of applying synthetic data to your solutions are many. It can provide more data at a lower price to improve the accuracy of models. It can remove bias by evening out the data by adding to the otherwise rare features or labels that would be a disadvantage for some groups. It can also improve the privacy of people whose personal data might be part of the training data. It can also let us test known and unknown scenarios. 

But is it all good? No. Synthetic data is not a silver bullet. It does come with the risk of adding to bias or bringing the data further from the world it is meant to represent. The challenge is that it is difficult to identify the cause of bias as synthetic data is often used where the real data is in shortfall and, by definition, challenging to reality-check. Synthetic data is a promising solution to many problems, but use it with care. As very few have experience in synthetic data in AI, we are unaware of many of the challenges that await.

For more tips, sign up for the book here: https://www.danrose.ai/book.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

合成数据 AI 应用场景 优缺点
相关文章