Knowing Machines

Knowing Machines is a research project tracing the histories, practices, and politics of how machine learning systems are trained to interpret the world.

We are developing critical methodologies and tools for understanding, analyzing, and investigating training datasets, and studying their role in the construction of “ground truth” for machine learning. Our research addresses how datasets index the world, make predictions, and structure knowledge cultures. Working with an international team, we aim to support the emerging field of critical data studies by contributing research, reading lists, research tools, and supporting communities of inquiry that are focused on the foundational epistemologies of machine learning.

Knowing Machines is sponsored by the Alfred P. Sloan Foundation.

Publications

Visual Story

Models all the Way Down

LAION-5B is an open-source foundation dataset used to train AI models such as Stable Diffusion. It contains 5.8 billion image and text pairs—a size too large to make sense of. In this visual investigation, we follow the construction of the dataset to better understand its contents, implications and entanglements.

Collection

Synthetic Media Media

This project traces how media systems are using, interpreting, and anticipating Generative AI to create public life. We’re studying how the news industry frames Generative AI, when and why journalists are using it in their work, which policies and guidelines organizations are creating to regulate its use, and how people and infrastructures have the power to make Generative AI a public problem.

Exhibition

Calculating Empires

Calculating Empires is a new exhibition by Kate Crawford and Vladan Joler that opens at Fondazione Prada on November 23, 2023 at the Osservatorio in Milan. Joler and Crawford contextualize the current explosion of artificial intelligence by asking how we got here — and to consider where we might be going. Multiple works of critical cartography span the two floors, and invite visitors to experience the longue durée of how technology and power have been intertwined since 1500.

Collection

Understanding the Work of Dataset Creators

The work of the people who make datasets is crucial. They build the architectures of ground truth that shape AI systems. Yet there has been very little research that has focused on dataset creators or listened to what they have to say. In this project, we speak with 18 different dataset creators in a series of interviews that reveal the messy and contingent realities of dataset preparation. We hear about their practices and the shared challenges they face. We offer a set of actionable recommendations that would improve the practice of dataset creation while also building a more responsible AI ecosystem.

Collection

Bird in hand

What can birding teach us about machine learning? And how is AI shaping how we interact with nature? Projects at the intersection of nature observation, citizen science, and machine learning offer useful case studies for examining systems of dataset production, model training and human feedback. They also present an alternative model to the extractive and exploitative “Big Data” approach to training machine learning algorithms, offering many possibilities as well as unique challenges for thinking through how we relate to AI systems.

Explainer

Generative AI Legal Explainer

Generative AI raises a host of legal questions and concerns. Some of these questions will challenge existing legal rules and require new laws and policy frameworks. Others have answers that are quite well settled, notwithstanding the new AI context bringing attention to them.

Collection

Knowing Legal Machines

Many of the social questions raised by artificial intelligence are mediated through the legal system. Policymakers explore new rules to govern the technology, courts work to apply existing legal framework to new situations, and advocates propose entirely new approaches to deal with novel problems (or old problems with new prominence).

Collection

9 ways to see a Dataset

To further the understanding of training data, the Knowing Machines Project developed SeeSet, an investigative tool for examining the training datasets for AI. Here you will find nine essays from individual members of our team. Each one uses SeeSet to explore a key AI dataset and its role in the construction of 'ground truth.'

Guide

A CRITICAL FIELD GUIDE FOR WORKING WITH MACHINE LEARNING DATASETS

Maybe you’re an engineer creating a new machine vision system to track birds. You might be a journalist using social media data to research Costa Rican households. You could be a researcher who stumbled upon your university’s archive of handwritten census cards from 1939. Or a designer creating a chatbot that relies on large language models like GPT-3. Perhaps you’re an artist experimenting with visual style combinations using DALLE-2. Or maybe you’re an activist with an urgent story that needs telling, and you’re searching for the right dataset to tell it.

Reading List

CRITICAL DATASET STUDIES

This collection provides a curated reading list for researchers, practitioners, and students seeking to understand how machine learning datasets work, are utilised, and are influenced by various social, political, and ethical issues. The list is organised into various sections to help readers follow their specific interests and is primarily focused on academic publications. This list is also not meant to be exhaustive. We see the list as a living resource and invite readers to make suggestions and contributions. We hope this reading list might serve as a useful resource for scholars and practitioners investigating ML datasets as sociotechnical assemblages that shape and are shaped by social worlds.