Society's Backend 02月06日
How AI companies get around data regulation
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

随着全球人工智能竞争的加剧,数据本地化法规日益增多,限制了跨境个人数据的使用。这些法规对人工智能公司的模型训练构成挑战,尤其是在需要来自多个地区的受限数据时。联邦学习作为一种解决方案,它允许在不离开数据源的情况下进行模型训练,从而规避了数据本地化法规。通过将模型发送到多个机器进行本地训练,然后将模型更新发送到中央节点进行协调,联邦学习能够在保护数据隐私的同时,实现跨区域的模型训练。虽然存在延迟、带宽和数据质量等挑战,但随着数据隐私监管的加强,联邦学习的优势将日益凸显。

🛡️ 数据本地化法规旨在保护公民数据免受外国实体滥用,并提升本国在全球人工智能竞赛中的地位。这些法规对跨国公司的数据使用和模型训练带来了实际影响。

🌍 公司若想利用跨境个人数据进行模型训练,必须遵守数据本地化法规,否则将面临法律风险。大型公司可以通过在数据可合法访问的地区进行模型训练来规避这些法规,但小型公司可能难以承担合规成本,从而限制了其在特定地区提供服务的能力。

🤝 联邦学习允许多个机器在本地数据上训练模型,而无需数据离开其来源地。模型更新被发送到中央节点进行协调,从而实现跨区域的模型训练,同时遵守数据本地化法规。这对于医疗记录、知识产权等敏感数据尤其有用。

⚠️ 联邦学习并非没有缺点,它依赖于多个机器之间的延迟、带宽和可用性,增加了基础设施的复杂性,并可能引入数据质量问题。然而,随着人工智能和数据隐私监管的加强,联邦学习的优势将日益突出。

Many AI applications rely on personalization to be useful for a user. With rising AI competition around the globe, we’ve seen many countries enact localization data privacy regulations.These regulations restrict access of personal information for a country’s citizens either by requiring their data to be geographically stored within the country’s borders or creating strict data exportation limits.

Notable regions and countries that have enacted policies like this are the European Union, China, Russia, and India–all countries that have a massive potential user base for AI products and can be large players in AI at the global scale.

We’re going to go over why a country enacts this regulation, what it means for companies creating AI products, and what federated machine learning is and how it circumvents regulatory restrictions legally.

Why this regulation?

There are two primary reasons countries enact this type of regulation:

    To protect their citizens’ data from being misused by a foreign entity. Once a person’s data has geographically left the country, it’s impossible to ensure its safety.

    To improve their position in the global AI race. Personal data is very valuable for training AI models and countries that can keep their competitors from having access to their citizens’ data are at an advantage.

This regulation has a very real two-fold impact: one fold affects companies and the other affects the citizens within those countries.

What does this regulation impact?

Companies wanting to train models on personal data across borders where localization laws are in play can’t just download that data, transfer it to their data center, and start training with the rest of their data–that would be illegal. This is especially tricky when data from multiple areas with localization laws is needed. This creates situations where training is either logistically complex or seemingly impossible.

Large companies with data centers in many regions can more easily get around these regulations (legally, of course) by training models in regions where the data can be legally accessed. Larger companies also tend to have enough data to fully train models without relying on data across multiple regions. Even in these cases, a great deal of money, time, and care is spent by these companies ensuring regulatory compliance.

Unfortunately, the same can’t be said for startups. Without the time and resources to ensure compliance, startups are forced to rely on a larger company’s infrastructure (think Google, Amazon, or Microsoft’s cloud offerings) to train their models. Even with the proper infrastructure, smaller companies may not have the data necessary to train in regions with data localization.

This means startups may not be able to offer their services or products to users in those regions. This is the second fold of the impact of localization laws: users miss out on AI products or get them on a delayed schedule while compliance is ensured. This is something we’ve seen occur many times in the European Union.

What is federated machine learning and how does it help?

There is a possible solution: federated machine learning. Many people are familiar with federation due to federated social media platforms and understand it as another term for distribution. It’s important to note that when it comes to machine learning, federated training isn’t the same as distributed training. It’s a subset of it.

Distributed training is defined as training models across multiple machines. This is the norm for modern AI, as large-scale models require numerous chips split across multiple machines to train. A datacenter is really just a house for many computers all connected together on a local network. LLMs and other modern machine learning applications use data centers for fast, efficient training.

Federated training takes place across multiple machines but instead of being defined by the location of training, it’s defined by the location of data. An overview of federated learning looks something like this:

    A model is sent to multiple machines across the internet.

    The model is trained on the separate machines using the data on that machine.

    When training is done, the model updates are sent to a central node.

    Model updates are reconciled across all machines on that central node.

    The model is updated and the updated model is sent back to the machines to continue training.

Instead of training across machines on the same local network, federated learning leverages training machines across the internet to train a single model on sensitive data without that data having to leave the premises. Only the model updates are sent back to the central server for reconciliation.

This reconciliation process generally uses federated averaging. Simply put, this updates the central model based on the average of the separate model updates. More sophisticated algorithms would weight the different training jobs based on their quality and relevance to the training task.

The most important takeaway about federated learning is training data never leaves its origin. This allows models to be trained across regions while complying with data localization laws. This is especially useful for sensitive data such as medical records, intellectual property, or anything with personally identifiable information. Another added benefit of federated training is reducing the risk of a large-scale data breach due to hosting large datasets on a single machine.

Of course, federated learning doesn’t come without its compromises. It:

As we see more AI and data privacy regulation introduced, the benefits of federated training will likely begin to outweigh the compromises.

Let me know what you think and feel free to ask any questions in the comments.


Always be (machine) learning,

Logan

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

联邦学习 数据本地化 人工智能 数据隐私
相关文章