If you read the leaked memo from Google, "We Have No Moat, and Neither Does Open AI", you’ll find a revealing admission about AI development: “data quality scales better than data size.” This short phrase points to a seachange in thinking that has significant implications for generative AI in particular. The last decade of AI has been marked by a relentless pursuit of scale: more data, and more computational resources to process that data. Training datasets went from being relatively small scale with some human curation, to being massive, indiscriminate dragnets of the internet with little to no curation at all.
Back in 2003, Caltech 101 had under 10,000 images. By 2010, ImageNet was approaching 14 million images. In 2022, LAION 5-B – used to create systems such as Stable Diffusion – has more than five billion images scraped from the web, with their corresponding text captions. In April 2023, CommonPool was launched with 12.8 billion image-text pairs. We are reaching a point where the entire territory of the internet has become the map of AI.
For example, if you looked at each image for ten seconds in ImageNet’s collection, it would take you four and half years – difficult but doable. But even such a cursory examination of the contents of LAION 5-B would take you 1,584 years, and a dip in the CommonPool would take 4,000 years. That’s fifty lifetimes.
There’s a widely held misapprehension that this is working well, and therefore understanding what’s in training data doesn’t matter: it’s an amorphous, vast mass of “stuff,” and only the size matters, not what it represents or how. It’s just grist to the mills of generative AI models. But training datasets are more important than ever. They determine the boundaries of what is known and unknown, they mark out the perimeter of the intelligible, they encode worldviews. And therefore it's critical to understand how data is being used in generative AI systems.
The pretraining approach used in generative AI systems is unsupervised. Unlike supervised models that learn from labeled data, unsupervised models only learn the underlying patterns and distributions of training data without explicit labels, such as predicting the next word in a sequence or the correlations between words and images. The emphasis on scale and lack of attention to context has created a kind of Engineering X-Games where the winners are the ones who can release new models the fastest, based on the largest possible dataset – regardless of where it’s from, what it is, or who made it.
We've seen the results of this thinking. Generative AI systems like Stable Diffusion produce extreme racial and gender stereotypes; creative producers from illustrators to photographers to programmers are suing AI companies for exploiting their work; and low-paid crowdworkers in places like Kenya, China, and Columbia are enlisted to perform traumatic post-hoc clean-up on toxic results. These problems could be mitigated or prevented entirely with greater attention to how AI systems are trained in the first instance.
Back in 2017, I began looking through hundreds of training sets with the artist Trevor Paglen. We gathered influential datasets of text, images, and video. Few of these training datasets had ever been closely examined or assessed. How would we begin? In the early days, our work was humblingly manual: scanning through endless Excel sheets, sifting through thousands of images, developing a sense of the classificatory approaches at work. We mapped out their contents, taxonomies, and origin stories, with the twin aims of academic research and curating an exhibition. When working with ImageNet's vast collection, we used automated tools built by the technical researcher Leif Rygge. Ultimately, this included training models like ImageNet Roulette, so that people could see its racialized, gendered, and dehumanizing classificatory logics for themselves.
The Google memo points to the dawning realization that improvements in AI will require putting a lot more care and thought into how data is collected and curated. Even OpenAI, which relies on gargantuan datasets to make its products, is now pointing to this issue. A close engagement with datasets has been deeply undervalued in the AI field, and this neglect has had serious consequences downstream, from technical failures to human rights violations.
This is why investigating datasets is so important. Not because companies want an edge in the current AI wars, but to understand the ideologies, viewpoints, and harms that are being ingested, concentrated, and reproduced by AI systems. The new internet-scale datasets require new investigative methods, new research questions. What political and cultural inflections are baked into training sets? Who and what is represented? What is rendered invisible and unintelligible? Who profits from all this data, and at whose expense? What legal issues does the mass extraction of data raise for copyright, privacy, moral rights, and the right to publicity? What about the people whose creative work and livelihoods are impacted? How could these practices change? And as the accelerating machines of scrape-generate-publish-repeat begin to ingest their own material, what logics, perspectives, and aesthetics will be reinforced in this recursive loop?
The non-human scale of training data also necessitates new tools. At the Knowing Machines Project, we've been developing resources for people to understand how to work critically with datasets, and building new tools to see into datasets. Christo Buschek, whose background is in data investigations, designed our see:set tool in order to break up these chaotic collections into human-readable galleries. We have studied biodiversity datasets, object detection datasets, news datasets, and face recognition datasets among others. We're publishing the results of this work and sharing the foundational issues we're finding from technical, legal, ethical, and epistemological perspectives.
See:set offers new ways of seeing data, and the essays in this collection use it to show what’s at stake in the practice of trying to capture the world in a dataset. To know that datasets encode politics is the necessary first move. The next step is to discover how, where, and to what ends.