Dan Rose AI | Applied AI Blog 2024年11月26日
What is data operations (DataOps)?
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了AI数据运营的重要性,强调其在AI解决方案中的核心作用。数据运营涵盖数据获取、清洗、存储和交付等环节,是许多大数据和AI解决方案的底层供应链。作者认为,相比算法开发,数据运营是AI竞争优势的关键,高质量的数据能够显著提升AI性能,而低成本的数据获取则能降低业务负担。文章以Paperflow为例,阐述了数据运营在实际应用中的挑战和解决方案,包括数据收集策略、成本控制、数据质量保证以及DevOps方面的挑战,并建议在AI项目初期就引入DevOps以构建可扩展的系统架构。

🤔**数据运营是AI竞争优势的关键:**作者认为,相比算法开发,数据运营在AI竞争中更具重要性。因为大多数AI算法较为标准,而高质量的数据能够显著提升AI性能,并降低业务成本。

💰**数据获取成本控制至关重要:**尤其是在需要持续更新数据的AI应用中,低成本获取新数据是关键。Paperflow的案例说明了内部数据收集与外部数据收集的成本差异,以及如何通过优化数据收集流程和系统来降低成本。

🔍**数据质量保证和精细化收集:**Paperflow通过内部数据收集,确保了数据质量,并通过优化系统提高数据录入效率。此外,通过聚焦AI预测不确定的数据,实现了数据收集的精细化,降低了错误风险。

🚀**DevOps在数据运营中的重要性:**数据量的快速增长可能给系统带来挑战,因此在AI项目初期就引入DevOps,构建可扩展的系统架构至关重要,避免后期出现系统扩展瓶颈。

💡**主动学习(Active Learning)的应用:**通过选择AI预测不确定的数据进行收集,能够提升数据质量,降低数据收集成本,这是一种值得借鉴的策略。

When I write about AI I very often refer to data operations and how important a foundation it is for most AI solutions. Without proper data operations you can easily get to a point where handling the necessary data will be too difficult and costly for the AI business case to make sense. So to clarify a little I wanted to give you some insight on what it really means.

Data operations is the process of obtaining, cleaning, storing and delivering data in a secure and cost effective manner. It’s a mix of business strategy, DevOps and data science and is the underlying supply chain for many big data and AI solutions. 

Data operations was originally coined in Big Data regi but has become a more broadly used term in the later years.

Data operations is the most important competitive advantage

As I have mentioned in a lot of previous posts, I see the data operations as a higher priority than algorithm development when it comes to trying to beat the competition. In most AI cases the algorithms used are standard AI algorithms from standard frameworks that are fed data, trained and tuned a little before being deployed. So since the underlying algorithms are largely the same the real difference is in the data. The work that goes into to get good results from high quality data is almost nothing compared to the amount of work it takes when using mediocre data. Getting data at a lower cost than the competition is also a really important factor. Especially in AI cases that require a continuous flow of new data. In these cases getting new data all the time can become an economic burden that will weigh down the business.

Data operations Paperflow example

To make it more concrete I wanted to use the AI company I co-founded Paperflow as an example. Paperflow is an AI company that receives invoices and other financial documents and captures data such as invoice date, amounts and invoice lines. Since invoices can look very different and the layout of invoices changes over time, getting a lot and getting more data all the time is necessary. So to make Paperflow a good business we needed good data operations. 

To be honest we weren't that aware of the importance when we made these initial decisions but luckily we got it right. Our first major decision in the data operations was that we wanted to collect all data in-house and make our own system for collecting data. That’s a costly investment with both a high investment into the initial system development but also a high recurring cost to our employees with the job of entering data from invoices into the system. The competition had chosen another strategy. They instead had the customers enter the invoice data to their system when their AI failed to make the right prediction on the captured data. That’s a much cheaper strategy that can provide you with a lot of data. The only problem is that customers only have one thing in mind and that is to solve their own problems disregarding if it is correct or not in terms of what you need for training data. 

So in Paperflow we found a way to get better data. But how do you get the costs down then?

A part of the solution was heavily investing in the system that was used for entering data and trying to make it as fast to use as possible. It was really trial and error and it took a lot of work. Without having the actual numbers I guess we invested more in the actual data operating systems than the AI. 

Another part of the solution was to make sure we only collected the data we actually needed. This is a common challenge in data operations since it’s very difficult to know what data you are going to need in the future. Our solution was to first go for collecting a lot of data (and too much) and then slowly narrowing down the amount of data collected. Going the other way around can be difficult. If we had suddenly started to collect more data on each invoice we would basically have needed to start over and discard all previously validated invoices. 

We also started to work a lot on understanding a very important metric. When were our AI guesses so correct that we trust it and avoid to validate a part of the data. That was achieved with a variety of different tricks and technologies one of them being probabilistic programming. Probabilistic programming has the advantage of delivering a uncertainty distribution instead of a percentage that most machine learning algorithms will do. By knowing how sure you are that you are such significantly lowers the risks of making mistakes.

The strategy of only collecting data that you need the most by choosing cases where you AI is the most uncertain is also known as active learning. If you are working on your data operations for AI, you should definitely look into that.

DevOps data operation challenges

On the more tech-heavy part of storing data in an effective way you will also see challenges. I’m not a DevOps expert but I have seen the problem of suddenly having too much data that grows faster than expected in real life. That can be crucial since the scaling ability quickly is coming under pressure. If I could provide one advice here it would be to involve a DevOps early on in the architecture work. Building on a scalable foundation is much more fun than trying to find short term solutions all the time.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI 数据运营 数据获取 算法 DevOps
相关文章