9 ways to see a Dataset

Artificial intelligence systems are often considered enigmatic or unknowable – complex neural networks whose parameters can number in the trillions and whose possible array of outputs is greater still. Yet no matter how convoluted or complex these systems become, the data used to train them remains one of the most important sources of evidence we can use to trace the histories, practices, and politics of how these systems interpret the world.

To further the understanding of training data, the Knowing Machines Project developed see:set, an investigative tool for examining the training datasets for AI. Here you will find nine essays from individual members of our team. Each one uses see:set to explore a key AI dataset and its role in the construction of “ground truth.” We invite you to use them to further interrogate the ways these systems structure knowledge, make predictions, represent reality, and intervene in the world.


9 Ways To See A Dataset: What’s at stake in examining datasets?

Kate Crawford
If you read the leaked memo from Google, “We Have No Moat, and Neither Does Open AI”, you’ll find a revealing admission about AI development: “data quality scales better than data size.” This short phrase points to a seachange in thinking that has significant implications for generative AI in particular

9 Ways To See A Dataset: Datasets as Institutions — The New York Times Annotated Corpus

Mike Ananny
The NYT Annotated Corpus (NYTAC) shows how datasets are institutional achievements.

9 Ways To See A Dataset: Investigating Datasets

Christo Buschek
While research and investigations share many similarities, ther are also key differences that set them apart. An investigation aims to scrutinize the investigative subject, to discover something hidden or secret, and then tell a story about it.

9 Ways To See A Dataset: NABirds And The Instability Of Categories

Hamsini Sridharan
How does a machine learning algorithm “recognize” a bird? Generic computer vision datasets, composed of images scraped unceremoniously from sites like Flickr, are most useful for general object recognition, training algorithms to distinguish a raven from a writing desk (or, at any rate, “bird” from “furniture”). But what species of raven is it? Answering that question requires access to images broken into much narrower, more precise categories.

9 Ways To See A Dataset: Investigating ImageNet

Sasha Luccioni
ImageNet is one of the first datasets AI researchers are exposed to when learning and experimenting with computer vision approaches. Since it was first released in 2009, it has been used to train and evaluate nearly every AI model in the object recognition task, and improvement upon state-of-the-art performance on the dataset can translate into getting accepted into top AI conferences and appearing on leaderboards.

9 Ways To See A Dataset: Datasets as sociotechnical artifacts — The case of 'Colossal Cleaned Common Crawl' (C4)

Will Orr
The case of the Colossal Cleaned Common Crawl (C4) dataset underscores the importance of uncovering the sociotechnical dynamics that contribute to the formation of datasets.

9 Ways To See A Dataset: What Can LAION Teach Us About Copyright Law?

Jason Schultz
Looking through millions of dataset images can be a disorienting experience. Each one says something. But what does a sea of them say about the current status of copyright law?

9 Ways To See A Dataset: Consider The Iceland Gull

Jer Thorp
What makes a bird hard to see? It may be a small bird, or a well camouflaged one. It may be rare, or it could live in a place that’s hard to get to. Or, it might be hiding in plain sight.

9 Ways To See A Dataset: Some Blobs Are Human, Too

Jer Thorp
If you’ve found a featureless blob and you’d like to identify it, your first stop should be iNaturalist.