How should we study datasets in machine learning? As machine learning (ML) increasingly becomes a site of sociotechnical inquiry, invoking numerous social, political, legal, and ethical issues, datasets are a crucial component as they are core material used to train models. Inspired by Tarleton Gillespie and Nick Seaver’s Critical Algorithm Studies reading list, this collection is meant to serve as an entry point to the growing literature on ML datasets across the fields of computer science, human-computer interaction, science and technology studies, media studies, and histories of technology, among others. We compiled this list primarily as a resource for researchers seeking to understand—from a variety of perspectives—how ML datasets work, do work, and are worked upon. We hope it will also be of use to technology practitioners and students seeking to build ML systems.

We limit our scope to works that focus on datasets deployed in the training and testing of ML systems, and despite some overlap, this list is not a primer for the field of critical technology studies more generally. Entries are sorted into various sections with the intention of providing readers a preliminary structure that will help them follow their specific interests. We acknowledge that classificatory practice is always subjective and that many of these titles can fit appropriately under multiple sections or named in different ways. The current iteration is a reflection of our own ideas and what we find helpful as a way to organize the emerging literature that we are working with. There are certainly other ways to structure this reading list, and we are open to suggestions that expand its range and improve usability. Our focus is primarily on academic publications, but for those who are more interested in understanding how datasets have been discussed in the press as of July 2022, we offer a selection of examples at the end of the reading list.

This list is also not meant to be exhaustive. We see the list as a living resource and invite readers to make suggestions and contributions via this form if there are key titles that they think should be included. Please note that while all links are functional as of July 2022, we are unable to continuously monitor for updated versions of papers or fix broken links.

Despite these limitations, we hope this reading list might serve as a useful resource for scholars and practitioners investigating ML datasets as sociotechnical assemblages that shape and are shaped by social worlds.

STARTING POINTS

CONTEXTUALIZING THE STUDY OF DATASETS

PUBLIC SOURCES OF DATASETS

STUDYING DATASET PRODUCTION

ANALYSES OF TRAINING DATASETS

RESPONSES TO DATASET PROBLEMS

DATASET DOCUMENTATION PRACTICES

CONFERENCES FOCUSED ON DATASETS

PRESS TREATMENT OF DATASETS