When I use GenAI products like ChatGPT, can they collect my data?


Probably yes.






When you use GenAI products, you generate data that companies can use to further enhance their products. This is a longstanding practice with most tech products – tech companies often design their products to maximize the amount and variety of your data that they can collect because the business insights these data can produce are incredibly valuable. The data broker industry that trades in user data is valued at roughly $260 billion.

This industry-wide thirst for your data has created the “Big Data” phenomenon of the past decade. In turn, this Big Data forms the core of GenAI: GenAI products are the latest application of advanced machine learning techniques for processing vast amounts of data to train algorithms to produce novel outputs. This means that data privacy issues with GenAI products aren’t new and evoke similar issues with earlier machine learning technologies. They compound past failures to pass meaningful data regulations that could have protected your personal data long before GenAI products existed.

But what does “your data” mean in this context? There are a lot of different types of information about you that GenAI products collect and process when you interact with them, and the specifics depend on which product you are using. Generally, there are two buckets of “user data”:

  • Inputs: Data that you input into the product, including the content of your written prompts, uploaded images, voice recordings, or other inputs. For example, if you input the prompt, “If my period is lasting longer than usual, and I am a 45 year old woman, am I perimenopausal? What are the signs of perimenopause in women with endometriosis?” into ChatGPT, your data includes personal information about your health, your gender identity, and your age. If you input your selfies into a text-to-image product like Stable Diffusion, your data includes any information that the product can ingest about your facial identity, as well as objects, people, or places discernable in the background. Inputs like selfies and facial images also include biometric data, or information about peoples’ unique biological features (here, facial features).
  • Metadata: Data that captures how you access and use the product, including things like the type of device you are using, your geolocation data, your name and/or phone number used to create an account, and any transaction history and/or account payment information if you purchase a subscription or upgrades to the product. In addition to the data that companies directly collect from their product, they can also collect your data when you interact with a separate website that has integrated their GenAI product plug-in. If you have a ChatGPT subscription and use Expedia or Instacart, for example, you can use each website’s individual chat feature to plan your next trip or generate new recipes. Websites with these plug-ins send data to OpenAI about how you interact with the product, which OpenAI can use to train ChatGPT further to improve the relevancy of its outputs.

Because we lack comprehensive federal data protection laws in the U.S. that constrain companies’ ability to collect your data, companies independently decide how much of your data they will collect and for which purposes. We rely on them to disclose their user data practices in privacy policies that can be ambiguous and even contradictory, but that also give companies legal cover when users “consent” to privacy policies in order to access and use GenAI products.

For example, to use ChatGPT, users agree to OpenAI’s privacy policy which lists a variety of personal information OpenAI collects from users, including their name, contact details, device identifiers, geolocation data, and more. The policy explains that OpenAI can disclose this information “to [their] affiliates, vendors and service providers, law enforcement, and parties involved in [business] Transactions,” without providing more details about which parties they do business with, who their affiliates are, and under what circumstances they interact with law enforcement.

Similarly, Lensa app developer Prisma Labs explains in its privacy policy how it collects personal data directly from users as inputs into Lensa, metadata about how users are interacting with the app (including your device information, app and country information, in-app purchases, and app usage), and from “third parties, for example, our service providers, partners and vendors.” When you use Lensa, Prisma collects the photos and/or videos of yourself that you upload to the app, information regarding your gender identity, any account information (for example, if you link your Apple ID or Google Account email address to Lensa), and any contact information you provide to the company (including your name, postal address, email addresses, and phone numbers). Even if you delete your account, Prisma may still share your personal data with third parties, including cloud providers Amazon Web Services and Google Cloud Platform and email delivery providers; the personal data Prisma can disclose to them may include your facial images and image data, defined as including “information about your facial position, orientation and topology”.

There are a handful of recent lawsuits challenging these companies’ ability to collect data while you use GenAI products, however. P.M. v. OpenAI and A.T. v. OpenAI, for example, both accuse OpenAI of violating wiretapping and anti-hacking laws. They argue that OpenAI effectively hacks users' platform access, exceeding users' authorized access for OpenAI when using ChatGPT, and intercepts their private information (i.e., user data) without their knowledge or consent.

These claims are likely to fail. Although the plaintiffs claim that OpenAI is unlawfully intercepting these communications by collecting them without users’ consent, ChatGPT users do technically consent to OpenAI’s Privacy Policy and data collection practices when they use ChatGPT, and OpenAI’s Policy is explicit about the types of user data the company may collect when users interact with their product.

While plaintiffs claim not to have notice from third party websites that have integrated ChatGPT plugins, these websites generally have their own privacy policies that explicitly describe the companies’ ability to share user data with “business partners” and “service providers,” that would presumably include OpenAI.

For example, Kayak’s privacy policy states explicitly that the company shares user data with “other business partners,” in order to provide travel booking services; this is written broadly enough to encompass partners like OpenAI, whose ChatGPT plugin feature presumably helps Kayak provide travel booking services to users.

GenAI companies like OpenAI are a far cry from traditional wiretapping violators that intercepts private communications like phone calls without the consent of one or both speakers, nor are they akin to computer hackers (or free culture activists with technical chops) who target rich databases prone to compromise. In order to use ChatGPT and other GenAI products, users constructively consent to the collection and sharing of their data by clicking “I Accept” to the terms of use and privacy policies unilaterally drafted by providers. Although these notice-and-consent regimes largely fail to protect user data, they have been accepted by courts as legally valid and enforceable.

The Special Case of Biometric Data

Biometric data might be harder for companies to collect without complying with biometric privacy laws. In Flora v. Prisma Labs and P.M. v. OpenAI, the user-plaintiffs allege both companies violated several requirements under the Illinois Biometric Privacy Act (BIPA) that protect Illinois residents’ sensitive biometric data, specifically their facial image data in these cases.

Biometric data can include information about what you look like (i.e., what your face looks like, your irises, your thumbprints or palmprints, etc.), genetic data (which some laws include under the “biometric data” umbrella), as well as other types of personally identifiable information (PII) (under which some laws include biometric and genetic data). These are highly sensitive forms of data that many federal and state laws regulate in a variety of ways, because once your unique biometric data is compromised, it is nearly impossible to fix that breach (at least without undergoing extreme physical or genetic adjustments).

BIPA is a prime example of such regulations. It requires companies that collect biometric data to comply with several procedures before, during, and after collecting users’ biometric data. Companies must: (1) have a written policy establishing a schedule for destroying biometric information within a set timeframe (§ 15(a)); (2) first inform users in writing of data collection, the specific purpose and length of the collection, and receive a written release from users before they “collect, capture, purchase, receive through trade, or otherwise obtain” biometric data (§ 15(b)); (3) refrain from selling, leasing, trading, or profiting from users’ biometric data (§ 15(c)); (4) refrain from disclosing or disseminating user biometric data without user consent, among other requirements (§ 15(d)); (5) and must store, transmit, and protect biometric data from disclosure using “the reasonable standard of care” within that company’s industry or in a way that is consistent with how the company treats other sensitive data it possesses (§ 15(e)). Plaintiffs need only to show that a company has failed to meet any of these requirements to make a strong BIPA claim.

The BIPA claims in these two cases are important because BIPA has proven to be an effective, roundabout way to chastise companies for their invasive biometric data collection practices. Several powerful tech companies have chosen to settle BIPA lawsuits alleging they similarly collected and relied on users’ biometric data without following BIPA’s procedures: ACLU settled with Clearview AI over its use of facial biometric data; Facebook settled over its scanning of facial geometry in users’ photos; Google also settled a similar suit, as did Snap (the company that owns SnapChat), and TikTok most recently. These were all multimillion dollar settlements, except for the Clearview settlement which severely restricted Clearview’s U.S. client-base to law enforcement agents only.

In Flora, the plaintiffs allege Prisma failed to meet all of BIPA’s procedural requirements before collecting Lensa users’ selfies, pulling their facial geometry information from those images, and training the model on those geometries to output personalized avatars. Importantly, the Flora plaintiffs rely heavily on the version of Prisma’s Privacy Policy that was in place when they downloaded the app in early December 2022. According to them, the wording of that policy was vague with the terms it used to describe the facial information Lensa collects, never using the terms “facial geometry” or “biometric” despite using the term “face data” in a separate part of the Policy (after claiming the company does not collect such data).

In P.M., a subset of Illinois-based plaintiffs also allege that OpenAI violated BIPA’s procedural requirements. They claim that OpenAI collected and relied on their facial images from scraped photos from the internet to train image diffusion products like DALL-E, which can generate realistic depictions of human (and human-like) faces. Like Prisma, the plaintiffs allege that OpenAI did not have a public written policy concerning their use of facial data, did not receive a written release to collect and use this data from their images, and did not comply with several other BIPA requirements in developing and publishing DALL-E. If the case moves forward into discovery, OpenAI may have to finally divulge the internet sources of its training data beyond what has been uncovered by others.

LAST UPDATED 12/04/2023