Summary of unpublished article.
The field of critical dataset studies examines how datasets structure knowledge production through machine learning (ML). “Big Data” approaches to ML datasets emphasize mass data scraping; “low skill” annotation labor; and generic computational problems. Here, we study an alternative approach: the Visipedia computer vision project. Visipedia draws on amateur naturalists and professional scientists to construct the datasets used to train and evaluate species identification models. Through interviews and analysis of technical publications, we explore the perspectives on classification, labor, and expertise evinced in Visipedia’s datasets and tools, including eBird, Merlin, iNaturalist, and NABirds.
Building on literature from critical dataset studies and science and technology studies, we argue that logics of hybridity—of communities of practice; of humans and machines; and of living things vis-a-vis taxonomy—weave throughout Visipedia’s efforts. These logics shape how its datasets structure knowledge production in ways that differ from the “Big Data” paradigm and illuminate the possibilities and challenges of Visipedia’s approach to ML. On the one hand, Visipedia’s approach, rooted in citizen science, shows how communities of practice can come together across domains (biology and computer science) and levels of expertise (amateur and professional) to assemble ML datasets and tools that advance the goals of both computer vision and species identification for public enjoyment, scientific research, and conservation. It offers a direct counterpoint to the hidden and exploited labor of microworkers undergirding the appearance of machinic objectivity in datasets such as ImageNet. On the other hand, the hybridity lens shows how geographic, racial, and other disparities in the composition of naturalist communities become inherited in datasets; nature observation becomes more technologically dependent; and biological taxonomies fail to contain the behavior of living organisms (“odd ducks”).
We conclude that, given ML’s dependence on processes of classification, hybridity may be a useful lens for critical studies of ML datasets more broadly, showing how such classifications are constructed, and where they fall apart.