Unite.AI 03月02日
Is There a Clear Solution to the Privacy Risks Posed by Generative AI?
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨了生成式AI带来的诸多隐私风险,包括数据泄露、信息被不当使用等,同时提出了一些缓解风险的方法,强调公众对数据隐私的认知重要性。

生成式AI可生成新内容,训练过程涉及大量数据,存在隐私风险

个人信息在LLM预训练和微调中可能被泄露或不当使用

RLHF阶段用户交互信息可能被查看、分享,用户常未意识到风险

数据经纪人使个人数据更易被收集和传播,加大隐私风险

The privacy risks posed by generative AI are very real. From increased surveillance and exposure to more effective phishing and vishing campaigns than ever, generative AI erodes privacy en masse, indiscriminately, while providing bad actors, whether criminal, state-sponsored or government, with the tools they need to target individuals and groups.

The clearest solution to this problem involves consumers and users collectively turning their backs on AI hype, demanding transparency from those who develop or implement so-called AI features, and effective regulation from the government bodies that oversee their operations. Although worth striving for, this isn’t likely to happen anytime soon.

What remains are reasonable, even if necessarily incomplete, approaches to mitigating generative AI privacy risks. The long-term, sure-fire, yet boring prediction is that the more educated the public becomes about data privacy in general, the lesser the privacy risks posed by the mass adoption of generative AI.

Do We All Get the Concept of Generative AI Right?

The hype around AI is so ubiquitous that a survey of what people mean by generative AI is hardly necessary. Of course, none of these “AI” features, functionalities, and products actually represent examples of true artificial intelligence, whatever that would look like. Rather, they’re mostly examples of machine learning (ML), deep learning (DL), and large language models (LLMs).

Generative AI, as the name suggests, can generate new content – whether text (including programming languages), audio (including music and human-like voices), or videos (with sound, dialogue, cuts, and camera changes). All this is achieved by training LLMs to identify, match, and reproduce patterns in human-generated content.

Let’s take ChatGPT as an example. Like many LLMs, it’s trained in three broad stages:

All three stages of the training process involve data, whether massive stores of pre-gathered data (like those used in pre-training) or data gathered and processed almost in real time (like that used in RLHF). It’s that data that carries the lion’s share of the privacy risks stemming from generative AI.

What Are the Privacy Risks Posed by Generative AI?

Privacy is compromised when personal information concerning an individual (the data subject) is made available to other individuals or entities without the data subject’s consent. LLMs are pre-trained and fine-tuned on an extremely wide range of data that can and often does include personal data. This data is typically scraped from publicly available sources, but not always.

Even when that data is taken from publicly available sources, having it aggregated and processed by an LLM and then essentially made searchable through the LLM’s interface could be argued to be a further violation of privacy.

The reinforcement learning from human feedback (RLHF) stage complicates things. At this training stage, real interactions with human users are used to iteratively correct and refine the LLM’s responses. This means that a user’s interactions with an LLM can be viewed, shared, and disseminated by anyone with access to the training data.

In most cases, this isn’t a privacy violation, given that most LLM developers include privacy policies and terms of service that require users to consent before interacting with the LLM. The privacy risk here lies rather in the fact that many users are not aware that they’ve agreed to such data collection and use. Such users are likely to reveal private and sensitive information during their interactions with these systems, not realizing that these interactions are neither confidential nor private.

In this way, we arrive at the three main ways in which generative AI poses privacy risks:

These are all risks to users’ privacy, but the chances of personally identifiable information (PII) ending up in the wrong hands still seem fairly low. That is, at least, until data brokers enter the picture. These companies specialize in sniffing out PII and collecting, aggregating, and disseminating if not outright broadcasting it.

With PII and other personal data having become something of a commodity and the data-broker industry springing up to profit from this, any personal data that gets “out there” is all too likely to be scooped up by data brokers and spread far and wide.

The Privacy Risks of Generative AI in Context

Before looking at the risks generative AI poses to users’ privacy in the context of specific products, services, and corporate partnerships, let’s step back and take a more structured look at the full palette of generative AI risks. Writing for the IAPP, Moraes and Previtali took a data-driven approach to refining Solove’s 2006 “A Taxonomy of Privacy”, reducing the 16 privacy risks described therein to 12 AI-specific privacy risks.

These are the 12 privacy risks included in Moraes and Previtali’s revised taxonomy:

This makes for some fairly alarming reading. It’s important to note that this taxonomy, to its credit, takes into account generative AI’s tendency to hallucinate – to generate and confidently present factually inaccurate information. This phenomenon, even though it rarely reveals real information, is also a privacy risk. The dissemination of false and misleading information affects the subject’s privacy in ways that are more subtle than in the case of accurate information, but it affects it nonetheless.

Let’s drill down to some concrete examples of how these privacy risks come into play in the context of actual AI products.

Direct Interactions with Text-Based Generative AI Systems

The simplest case is the one that involves a user interacting directly with a generative AI system, like ChatGPT, Midjourney, or Gemini. The user’s interactions with many of these products are logged, stored, and used for RLHF (reinforcement learning from human feedback), supervised instruction fine-tuning, and even the pre-training of other LLMs.

An analysis of the privacy policies of many services like these also reveals other data-sharing activities underpinned by very different purposes, like marketing and data brokerage. This is a whole other type of privacy risk posed by generative AI: these systems can be characterized as huge data funnels, collecting data provided by users as well as that which is generated through their interactions with the underlying LLM.

Interactions with Embedded Generative AI Systems

Some users might be interacting with generative AI interfaces that are embedded in whatever product they’re ostensibly using. The user may know that they’re using an “AI” feature, but they’re less likely to know what that entails in terms of data privacy risks. What comes to the fore with embedded systems is this lack of appreciation of the fact that personal data shared with the LLM could end up in the hands of developers and data brokers.

There are two degrees of lack of awareness here: some users realize they’re interacting with a generative AI product; and some believe that they’re using whatever product the generative AI is built into or accessed through. In either case, the user may well have (and probably did) technically consent to the terms and conditions associated with their interactions with the embedded system.

Other Partnerships That Expose Users to Generative AI Systems

Some companies embed or otherwise include generative AI interfaces in their software in ways that are less obvious, leaving users interacting – and sharing information – with third parties without realizing it. Luckily, “AI” has become such an effective selling point that it’s unlikely that a company would keep such implementations secret.

Another phenomenon in this context is the growing backlash that such companies have experienced after trying to share user or customer data with generative AI companies such as OpenAI. The data removal company Optery, for example, recently reversed a decision to share user data with OpenAI on an opt-out basis, meaning that users were enrolled in the program by default.

Not only were customers quick to voice their disappointment, but the company’s data-removal service was promptly delisted from Privacy Guides’ list of recommended data-removal services. To Optery’s credit, it quickly and transparently reversed its decision, but it’s the general backlash that’s significant here: people are starting to appreciate the risks of sharing data with “AI” companies.

The Optery case makes for a good example here because its users are, in some sense, at the vanguard of the growing skepticism surrounding so-called AI implementations. The kinds of people who opt for a data-removal service are also, typically, those who will pay attention to changes in terms of service and privacy policies.

Evidence of a Burgeoning Backlash Against Generative AI Data Use

Privacy-conscious consumers haven’t been the only ones to raise concerns about generative AI systems and their associated data privacy risks. At the legislative level, the EU’s Artificial Intelligence Act categorizes risks according to their severity, with data privacy being the explicitly or implicitly stated criterion for ascribing severity in most cases. The Act also addresses the issues of informed consent we discussed earlier.

The US, notoriously slow to adopt comprehensive, federal data privacy legislation, has at least some guardrails in place thanks to Executive Order 14110. Again, data privacy concerns are at the forefront of the purposes given for the Order: “irresponsible use [of AI technologies] could exacerbate societal harms such as fraud, discrimination, bias, and disinformation” – all related to the availability and dissemination of personal data.

Returning to the consumer level, it’s not just particularly privacy-conscious consumers that have balked at privacy-invasive generative AI implementations. Microsoft’s now-infamous “AI-powered” Recall feature, destined for its Windows 11 operating system, is a prime example. Once the extent of privacy and security risks was revealed, the backlash was enough to cause the tech giant to backpedal. Unfortunately, Microsoft seems not to have given up on the idea, but the initial public reaction is nonetheless heartening.

Staying with Microsoft, its Copilot program has been widely criticized for both data privacy and data security problems. As Copilot was trained on GitHub data (mostly source code), controversy also arose around Microsoft’s alleged violations of programmers’ and developers’ software licensing agreements. It’s in cases like this that the lines between data privacy and intellectual property rights begin to blur, granting the former a monetary value – something that’s not easily done.

Perhaps the greatest indication that AI is becoming a red flag in consumers’ eyes is the lukewarm if not outright wary public response Apple got to its initial AI launch, specifically in regards to data sharing agreements with OpenAI.

The Piecemeal Solutions

There are steps legislators, developers, and companies can take to ameliorate some of the risks posed by generative AI. These are the specialized solutions to specific aspects of the overarching problem, no one of these solutions is expected to be enough, but all of them, working together, could make a real difference.

All of these approaches to the problem are valid and necessary, but none is sufficient. They all require legislative support to come into meaningful effect, meaning that they’re doomed to be behind the times as this dynamic field continues to evolve.

The Clear Solution

The solution to the privacy risks posed by generative AI is neither revolutionary nor exciting, but taken to its logical conclusion, its results could be both. The clear solution involves everyday consumers becoming aware of the value of their data to companies and the pricelessness of data privacy to themselves.

Consumers are the sources and engines behind the private information that powers what’s called the modern surveillance economy. Once a critical mass of consumers starts to stem the flow of private data into the public sphere and starts demanding accountability from the companies that deal in personal data, the system will have to self-correct.

The encouraging thing about generative AI is that, unlike current advertising and marketing models, it need not involve personal information at any stage. Pre-training and fine-tuning data need not include PII or other personal data and users need not expose the same during their interactions with generative AI systems.

To remove their personal information from training data, people can go right to the source and remove their profiles from the various data brokers (including people search sites) that aggregate public records, bringing them into circulation on the open market. Personal data removal services automate the process, making it quick and easy. Of course, removing personal data from these companies’ databases has many other benefits and no downsides.

People also generate personal data when interacting with software, including generative AI. To stem the flow of this data, users will have to be more mindful that their interactions are being recorded, reviewed, analyzed, and shared. Their options for avoiding this boil down to restricting what they reveal to online systems and using on-device, open-source LLMs wherever possible. People, on the whole, already do a good job of modulating what they discuss in public – we just need to extend these instincts into the realm of generative AI.

The post Is There a Clear Solution to the Privacy Risks Posed by Generative AI? appeared first on Unite.AI.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

生成式AI 隐私风险 数据泄露 用户意识 数据经纪人
相关文章