Critical Dataset Studies Reading List

How should we study datasets in machine learning? As machine learning (ML) increasingly becomes a site of sociotechnical inquiry, invoking numerous social, political, legal, and ethical issues, datasets are a crucial component as they are core material used to train models. Inspired by Tarleton Gillespie and Nick Seaver’s Critical Algorithm Studies reading list, this collection is meant to serve as an entry point to the growing literature on ML datasets across the fields of computer science, human-computer interaction, science and technology studies, media studies, and histories of technology, among others. We compiled this list primarily as a resource for researchers seeking to understand—from a variety of perspectives—how ML datasets work, do work, and are worked upon. We hope it will also be of use to technology practitioners and students seeking to build ML systems.

We limit our scope to works that focus on datasets deployed in the training and testing of ML systems, and despite some overlap, this list is not a primer for the field of critical technology studies more generally. Entries are sorted into various sections with the intention of providing readers a preliminary structure that will help them follow their specific interests. We acknowledge that classificatory practice is always subjective and that many of these titles can fit appropriately under multiple sections or named in different ways. The current iteration is a reflection of our own ideas and what we find helpful as a way to organize the emerging literature that we are working with. There are certainly other ways to structure this reading list, and we are open to suggestions that expand its range and improve usability. Our focus is primarily on academic publications, but for those who are more interested in understanding how datasets have been discussed in the press as of July 2022, we offer a selection of examples at the end of the reading list.

This list is also not meant to be exhaustive. We see the list as a living resource and invite readers to make suggestions and contributions via this form if there are key titles that they think should be included. Please note that while all links are functional as of July 2022, we are unable to continuously monitor for updated versions of papers or fix broken links.

Despite these limitations, we hope this reading list might serve as a useful resource for scholars and practitioners investigating ML datasets as sociotechnical assemblages that shape and are shaped by social worlds.

STARTING POINTS

This section contains a broad set of introductory texts and locales to ground the study of training data. Resources included in this section cover the politics, possibilities, and pitfalls of ML training data and offer early provocations for thinking about particular aspects of training data, such as privacy or bias.

Barocas, S., & Selbst, A. D. (2016). Big Data’s Disparate Impact. California Law Review,104(3), 671–732. https://www.californialawreview.org/wp-content/uploads/2016/06/2Barocas-Selbst.pdf
Crawford, K. (2021). Atlas of AI: Power, Politics and the Planetary Costs of Artificial Intelligence, see ‘Data’ chapter (pp. 89-122). New Haven, CT: Yale University Press.
Crawford, K., & Paglen, T. (2019). Excavating AI: The Politics of Images in Machine Learning Training Sets. https://excavating.ai
Denton, E., Hanna, A., Amironesei, R., Smart, A., Nicole, H., & Scheuerman, M. K. (2020). Bringing the People Back In: Contesting Benchmark Machine Learning Datasets. 6. ArXiv. https://arxiv.org/abs/2007.07399
Harvey, A. (2021). Exposing.ai: Face and Biometric Image Datasets. https://exposing.ai/datasets/
MacKenzie, A., & Munster, A. (2019). Platform Seeing: Image Ensembles and Their Invisualities. Theory, Culture & Society, 36(5), 3–22. https://doi.org/10.1177/0263276419847508
Miceli, M., Posada, J., & Yang, T. (2022). Studying Up Machine Learning Data: Why Talk About Bias When We Mean Power? Proceedings of the ACM on Human-Computer Interaction, 6(GROUP), 1–14. https://doi.org/10.1145/3492853
Paullada, A., Raji, I. D., Bender, E. M., Denton, E., & Hanna, A. (2020). Data and Its (Dis)Contents: A Survey of Dataset Development and Use in Machine Learning Research. ArXiv. https://arxiv.org/abs/2012.05345v1
Roberge, J., & Castelle, M. (Eds.). (2020). The Cultural Life of Machine Learning: An Incursion into Critical AI Studies (1st ed. 2021 edition). Palgrave Macmillan.
Srinivasan, R., & Chander, A. (2021). Biases in AI Systems: A Survey for Practitioners. Queue, 19(2), 45-64. https://doi.org/10.1145/3466132.3466134
Suresh, H., & Guttag, J. V. (2021). A Framework for Understanding Sources of Harm throughout the Machine Learning Life Cycle. ArXiv. http://arxiv.org/abs/1901.10002
Thylstrup, N. B. (2022). The Ethics and Politics of Data Sets in the Age of Machine Learning: Deleting Traces and Encountering Remains. Media, Culture & Society. https://doi.org/10.1177/01634437211060226

CONTEXTUALIZING THE STUDY OF DATASETS

This section consists of broader foundational readings that don’t all necessarily deal specifically with machine learning datasets, but which the authors of this list have found useful to contextualize their study. We acknowledge that the titles below do not form an exhaustive index of all foundational readings, but point to them as particularly helpful ones for thinking about the ontological and epistemological complexities of the “dataset” as an object/genre of analysis.

a. Politics of Classification

This subsection focuses on classification as a practice of not only world-ordering, but also world-making, and how its logics underlie the ways in which datasets are conceived and built.

Boutyline, A., & Soter, L. K. Cultural Schemas: What They Are, How to Find Them, and What to Do Once You’ve Caught One. American Sociological Review, 86(4), 728–758. https://doi.org/10.1177/00031224211024525
Bechmann, A., & Bowker, G. C. (2019). Unsupervised by any other name: Hidden layers of knowledge production in artificial intelligence on social media. Big Data & Society, 6(1). https://doi.org/10.1177/2053951718819569
Bowker, G. C., & Star, S. L. (2000). Sorting Things Out: Classification and Its Consequences. Cambridge, MA: MIT Press.
Crawford, K. (2021). Atlas of AI: Power, Politics and the Planetary Costs of Artificial Intelligence, see ‘Classification’ chapter (pp. 123-150). New Haven, CT: Yale University Press.
Fourcade, M., & Healy, K. (2013). Classification Situations: Life-Chances in the Neoliberal Era. Accounting, Organizations and Society, 38(8), 559-572. https://doi.org/10.1016/j.aos.2013.11.002.
Goodwin, C. (2000). Practices of Color Classification. Mind, Culture, and Activity, 7(1&2), 19-36. https://doi.org/10.1080/10749039.2000.9677646
Rieder, B. (2017). Scrutinizing an Algorithmic Technique: The Bayes Classifier as Interested Reading of Reality. Information, Communication & Society, 20(1), 100-117. https://doi.org/10.1080/1369118X.2016.1181195
Sadre-Orafai, S. (2020). Typologies, Typifications, and Types. Annual Review of Anthropology, 49(1), 193-208. https://doi.org/10.1146/annurev-anthro-102218-011235

b. Critical Data Studies

Here, we introduce a few titles from the emerging field of Critical Data Studies which we believe are especially useful for the purposes of acquiring a nuanced and interdisciplinary understanding of datasets.

Andrejevic, M. (2019). Automated Media (1st edition). Routledge.
Beer, D. (2018). The Data Gaze. London, UK: SAGE.
Cheney-Lippold, J. (2017). We Are Data: Algorithms and the Making of our Digital Selves. New York, NY: NYU Press.
Chun, W. (2021). Discriminating Data. Cambridge, MA: MIT Press.
Cifor, M., Garcia, P., Cowan, T. L., Rault, J., Sutherland, T., Chan, A., . . . Nakamura, L. (2019). Feminist Data Manifest-No. Retrieved from https://www.manifestno.com/
Couldry, N., & Mejias, U. A. (2019). The Costs of Connection: How Data Is Colonizing Human Life and Appropriating It for Capitalism. Stanford, CA: Stanford University Press.
D’Ignazio, C., & Klein, L. F. (2020). Data Feminism. MIT Press.
Gitelman L. (2013). “Raw Data” Is an Oxymoron. MIT Press.
Hansson, K., & Dahlgren, A. (2022). Open research data repositories: Practices, norms, and metadata for sharing images. Journal of the Association for Information Science and Technology , 73(2), 303-316. https://doi.org/10.1002/asi.24571
Iliadis, A., & Russo, F. (2016). Critical data studies: An introduction. Big Data & Society, 3(2), 1-7. https://doi.org/10.1177/2053951716674238
Jaton, F. (2021). The Constitution of Algorithms: Ground-Truthing, Programming, Formulating. Cambridge, MA: MIT Press.
Kitchin, R. (2021). Data Lives. Bristol, UK: Bristol University Press.
Koopman, C. (2019). How We Became Our Data: A Genealogy of the Informational Person. Chicago, IL: University of Chicago Press.
Thorp, J. (2021). Living in Data: A Citizen's Guide to a Better Information Future. New York, NY: MCD.

c. Methodologies for Reading Data

This final subsection includes texts that deal more specifically with the different conceptualizations and methodologies through which datasets can be studied/read/analyzed.

boyd, d., & Crawford, K. (2012). Critical Questions for Big Data: Provocations for a Cultural, Technological, and Scholarly Phenomenon. Information, Communication & Society, 15(5), 662-679. https://doi.org/10.1080/1369118X.2012.678878
Brock A. (2015). Deeper Data: A Response to boyd and Crawford. Media, Culture & Society, 37(7):1084-1088. https://doi.org/10.1177/0163443715594105
Driscoll, K., & Walker, S. (2014). Working Within a Black Box: Transparency in the Collection and Production of Big Twitter Data. International Journal of Communication, 8, 1745–1764. https://ijoc.org/index.php/ijoc/article/view/2171/1159
Kitchin, R. (2014). The Data Revolution: Big Data, Open Data, Data Infrastructures and Their Consequences. London, UK: SAGE.
Leonelli, S., & Tempini, N. (Eds.). (2020). Data Journeys in the Sciences. Springer International Publishing.
Malevé, N. (2020). On the Data Set’s Ruins. AI & Society, 36, 1117–1131. https://doi.org/10.1007/s00146-020-01093-w
Metcalf, J., & Crawford, K. (2016). Where Are Human Subjects in Big Data Research? The Emerging Ethics Divide. Big Data & Society, 3(1), 1-14. https://doi.org/10.1177/2053951716650211
Munk, A. K., Olesen, A. G., & Jacomy, M. (2022). The Thick Machine: Anthropological AI Between Explanation and Explication. Big Data & Society, 9(1), 1-14. https://doi.org/10.1177/20539517211069891
Pasquale, F. (2021). Licensure as Data Governance. Knight First Amendment Institute. https://knightcolumbia.org/content/licensure-as-data-governance
Poirier, L. (2021). Reading Datasets: Strategies for Interpreting the Politics of Data Signification. Big Data & Society, 8(2), 1-19. https://doi.org/10.1177/20539517211029322
Suchman, L., & Trigg, R. H. (1993). Artificial Intelligence as Craftwork. In S. Chaiklin & J. Lave (Eds.), Understanding Practice (pp. 144-178). New York, NY: Cambridge University Press.
Zook, M., Barocas, S., boyd, d., Crawford, K., Keller, E., Gangadharan, S. P., Goodman, A., Hollander, R., Koenig, B. A., Metcalf, J., Narayanan, A., Nelson, A., & Pasquale, F. (2017). Ten Simple Rules for Responsible Big Data Research. PLOS Computational Biology, 13(3), e1005399. https://doi.org/10.1371/journal.pcbi.1005399

PUBLIC SOURCES OF DATASETS

While some datasets lie behind proprietary company walls, numerous datasets are available for public download. This section lists technical papers that accompany major public dataset releases, as well as popular repositories where disparate datasets are organized and made available to the broader public.

a. Source Papers for Noteworthy Datasets

New training datasets are typically accompanied by technical papers explaining the composition of the dataset and its potential applications. These papers often also include analyses of models using the new dataset and comparisons to similar existing datasets. There are infinitely more dataset source papers than can be included on this list; below is a sampling of the most highly cited and broadly influential releases.

Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2016). The Cityscapes Dataset for Semantic Urban Scene Understanding. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3213–3223. https://openaccess.thecvf.com/content_cvpr_2016/html/Cordts_The_Cityscapes_Dataset_CVPR_2016_paper.html
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A Large-Scale Hierarchical Image Database. 2009 IEEE Conference on Computer Vision and Pattern Recognition, 248–255. https://doi.org/10.1109/CVPR.2009.5206848
Geiger, A., Lenz, P., & Urtasun, R. (2012). Are We Ready for Autonomous Driving? The KITTI Vision Benchmark Suite. 2012 IEEE Conference on Computer Vision and Pattern Recognition, 3354–3361. https://doi.org/10.1109/CVPR.2012.6248074
Huang, G. B., Mattar, M., Berg, T., & Learned-Miller, E. (2008). Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments. Workshop on Faces in “Real-Life” Images: Detection, Alignment, and Recognition. https://hal.inria.fr/inria-00321923
Krizhevsky, A. (2009). Learning Multiple Layers of Features from Tiny Images. https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-Based Learning Applied to Document Recognition. Proceedings of the IEEE, 86(11), 2278–2324. http://yann.lecun.com/exdb/publis/index.html#lecun-98
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO: Common Objects in Context. In D. Fleet, T. Pajdla, B. Schiele, & T. Tuytelaars (Eds.), Computer Vision – ECCV 2014 (pp. 740–755). Springer International Publishing. https://doi.org/10.1007/978-3-319-10602-1_48
Marcus, M., Santorini, B., & Marcinkiewicz, M. A. (1993). Building a Large Annotated Corpus of English: The Penn Treebank. Technical Reports (CIS). https://repository.upenn.edu/cis_reports/237
Martin, D., Fowlkes, C., Tal, D., & Malik, J. (2001). A Database of Human Segmented Natural Images and Its Application to Evaluating Segmentation Algorithms and Measuring Ecological Statistics. Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, 2, 416–423. https://doi.org/10.1109/ICCV.2001.937655
Miller, G. A. (1995). WordNet: A Lexical Database for English. Communications of the ACM, 38(11), 39–41. https://doi.org/10.1145/219717.219748
Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A. Y., & Potts, C. (2013). Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 1631–1642. https://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf
Thomee, B., Shamma, D. A., Friedland, G., Elizalde, B., Ni, K., Poland, D., Borth, D., & Li, L.-J. (2016). YFCC100M: The New Data in Multimedia Research. Communications of the ACM, 59(2), 64–73. https://doi.org/10.1145/2812802

b. Dataset Repositories

These sites provide infrastructure for the organization, finding, and downloading of varying datasets.

Papers with Code: https://paperswithcode.com/datasets
UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets.php
Kaggle: https://www.kaggle.com/datasets
Hugging Face: https://huggingface.co/datasets
Google dataset search https://datasetsearch.research.google.com/

STUDYING DATASET PRODUCTION

Training data requires significant human and computational effort to create. It is through this process of production that many of the effects of training data come to be shaped, from the processes of collection to labeling, deployment to deprecation. Texts in this section provide glimpses into the work behind datasets from varying angles, whether examining these production processes from a critical lens or describing the overall workflow of training data production from a technical standpoint.

a. Sociotechnical / Critical Approaches to Labor of Training Data

These texts draw on approaches and frameworks from science and technology studies, political economy, and labor studies to examine the production of training data from a critical lens, understanding how power relations are at work in this process.

Famularo, J., Hensellek, B., & Walsh, P. (2021). Data Stewardship: A Letter to Computer Vision from Cultural Heritage Studies. CVPR 2021. https://www.academia.edu/49423941/Data_Stewardship_A_Letter_to_Computer_Vision_from_Cultural_Heritage_Studies?auto=citations&from=cover_page
Gray, M. L., & Suri, S. (2019).Ghost Work: How to Stop Silicon Valley from Building a New Global Underclass, see ‘Introduction: Ghosts in the Machine’ (pp. ix-xxxi) and ‘1. Humans in the Loop’ (pp. 1-38). Houghton Mifflin Harcourt.
Goetze, T. S., & Abramson, D. (2021). Bigger Isn’t Better: The Ethical and Scientific Vices of Extra-Large Datasets in Language Models. WebSci, pp. 69-75. https://doi.org/10.1145/3462741.3466809
Iliadis, A. (2019). The Tower of Babel problem: Making data make sense with Basic Formal Ontology. Online Information Review, 43(6), 1021–1045. https://doi.org/10.1108/OIR-07-2018-0210
Jones, P. (2021, September 22). Refugees Help Power Machine Learning Advances at Microsoft, Facebook, and Amazon. Rest of World. https://restofworld.org/2021/refugees-machine-learning-big-tech/
Miceli, M., Schuessler, M., & Yang, T. (2020). Between Subjectivity and Imposition: Power Dynamics in Data Annotation for Computer Vision. Proceedings of the ACM on Human-Computer Interaction, 4(CSCW2), 1-25. https://doi.org/10.1145/3415186
Newlands, G. (2021). Lifting the Curtain: Strategic Visibility of Human Labour in AI-as-a-Service. Big Data & Society, 8(1), 1-14. https://doi.org/10.1177/20539517211016026
Sachs, S. E. (2020). The Algorithm At Work? Explanation and Repair in the Enactment of Similarity in Art Data. Information, Communication & Society, 23(11), 1689–1705. https://doi.org/10.1080/1369118X.2019.1612933
Sambasivan, N. (2021). Seeing Like a Dataset from the Global South. Interactions, 28(4), 76–78. https://doi.org/10.1145/3466160
Sap, M., Card, D., Gabriel, S., Choi, Y., & Smith, N. A. (2019). The Risk of Racial Bias in Hate Speech Detection. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 1668–1678. https://doi.org/10.18653/v1/P19-1163

b. Organizational Workflows in Dataset Production

Texts included here look to training data production from a practitioner-oriented lens. They survey either the entire workflow of training data production or specific stages within this process to identify challenges and suggest best practices.

Amershi, S., Begel, A., Bird, C., DeLine, R., Gall, H., Kamar, E., Nagappan, N., Nushi, B., & Zimmermann, T. (2019). Software Engineering for Machine Learning: A Case Study. 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), 291–300. https://doi.org/10.1109/ICSE-SEIP.2019.00042
Ashmore, R., Calinescu, R., & Paterson, C. (2019). Assuring the Machine Learning Lifecycle: Desiderata, Methods, and Challenges. ArXiv. http://arxiv.org/abs/1905.04223
Barclay, I., Taylor, H., Preece, A., Taylor, I., Verma, D., & de Mel, G. (2020). A Framework for Fostering Transparency in Shared Artificial Intelligence Models by Increasing Visibility of Contributions. Concurrency and Computation: Practice and Experience, 33(19), e6129. https://doi.org/10.1002/cpe.6129
Bhardwaj, A., Bhattacherjee, S., Chavan, A., Deshpande, A., Elmore, A. J., Madden, S., & Parameswaran, A. G. (2014). DataHub: Collaborative Data Science & Dataset Version Management at Scale. ArXiv. http://arxiv.org/abs/1409.0798
Chandrabose, A., & Chakravarthi, B. R. (2021). An Overview of Fairness in Data – Illuminating the Bias in Data Pipeline. LTEDI. https://aclanthology.org/2021.ltedi-1.5
Dong, W., & Fu, W.-T. (2010). Cultural Difference in Image Tagging. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 981–984. https://doi.org/10.1145/1753326.1753472
Hanley, M., Khandelwal, A., Averbuch-Elor, H., Snavely, N., & Nissenbaum, H. (2020). An Ethical Highlighter for People-Centric Dataset Creation. ArXiv. http://arxiv.org/abs/2011.13583
Hutchinson, B., Smart, A., Hanna, A., Denton, E., Greer, C., Kjartansson, O., Barnes, P., & Mitchell, M. (2021). Towards Accountability for Machine Learning Datasets: Practices from Software Engineering and Infrastructure. ArXiv. http://arxiv.org/abs/2010.13561
Geiger, R., Cope, D., Ip, J., Lotosh, M., Shah, A., Weng, J., & Tang, R. (2021). “Garbage In, Garbage Out” Revisited: What Do Machine Learning Application Papers Report About Human-Labeled Training Data? ArXiv. https://doi.org/10.1162/qss_a_00144
Holstein, K., Vaughan, J. W., Daumé III, H., Dudík, M., & Wallach, H. (2019). Improving Fairness in Machine Learning Systems: What Do Industry Practitioners Need? Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, 1–16. https://doi.org/10.1145/3290605.3300830
Muller, M. J., Wolf, C. T., Andres, J., Desmond, M., Joshi, N. N., Ashktorab, Z., Sharma, A., Brimijoin, K., Pan, Q., Duesterwald, E., & Dugan, C. (2021). Designing Ground Truth and the Social Life of Labels. Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, 1-16. https://doi.org/10.1145/3411764.3445402
Polyzotis, N., Roy, S., Whang, S. E., & Zinkevich, M. (2018). Data Lifecycle Challenges in Production Machine Learning: A Survey. ACM SIGMOD Record, 47(2), 17–28. https://doi.org/10.1145/3299887.3299891
Roh, Y., Heo, G., & Whang, S. E. (2021). A Survey on Data Collection for Machine Learning: A Big Data - AI Integration Perspective. IEEE Transactions on Knowledge and Data Engineering, 33(4), 1328–1347. https://doi.org/10.1109/TKDE.2019.2946162
Tatman, R. (2018). Setting Up Your Public Data for Success. 2018 IEEE International Conference on Big Data (Big Data), 3261–3262. https://doi.org/10.1109/BigData.2018.8622190
Sachdeva, P. S., Barreto, R., von Vacano, C., & Kennedy, C. J. (2022). Assessing Annotator Identity Sensitivity via Item Response Theory: A Case Study in a Hate Speech Corpus. 2022 ACM Conference on Fairness, Accountability, and Transparency, 1585–1603. https://doi.org/10.1145/3531146.3533216
Sambasivan, N., & Veeraraghavan, R. (2022). The Deskilling of Domain Expertise in AI Development. Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, 1-14. https://doi.org/10.1145/3491102.3517578
Shanmugam, D., Diaz, F., Shabanian, S., Funck, M., & Biega, A. (2022). Learning to Limit Data Collection via Scaling Laws: A Computational Interpretation for the Legal Principle of Data Minimization. 2022 ACM Conference on Fairness, Accountability, and Transparency, 839-849. https://doi.org/10.1145/3531146.3533148
Vaughan, J. W. (2018). Making Better Use of the Crowd: How Crowdsourcing Can Advance Machine Learning Research. Journal of Machine Learning Research, 18(193), 1–46. https://dl.acm.org/doi/10.5555/3122009.3242050
Wang, D., Prabhat, S., & Sambasivan, N. (2022). Whose AI Dream? In search of the aspiration in data annotation. Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, 1-16. https://doi.org/10.1145/3491102.3502121

ANALYSES OF TRAINING DATASETS

This section highlights works that analyze training datasets from a variety of methodological and theoretical perspectives. While we understand that many of the titles that span across the major headings in this reading list involve some form of “dataset analysis,” we highlight in this particular section studies in which the analysis itself comprises the thrust of the article/chapter/work. The works in this section focus primarily on the details of the analysis as opposed to conducting an analysis as a preliminary step to introduce a more central argument or intervention.

a. Sociotechnical & Critical Studies

This subsection focuses on articles and chapters that approach their analyses of training datasets grounded in frameworks primarily taken from critical studies or science and technology studies.

Bao, M., Zhou, A., Zottola, S. A., Brubach, B., Desmarais, S., Horowitz, A., Lum, K., & Venkatasubramanian, S. (2021). It’s COMPASlicated: The Messy Relationship between RAI Datasets and Algorithmic Fairness Benchmarks. ArXiv. https://arxiv.org/abs/2106.05498
Busch, L. (2014). A Dozen Ways to Get Lost in Translation: Inherent Challenges in Large Scale Data Sets. International Journal of Communication, 8, 1727-1744. https://ijoc.org/index.php/ijoc/article/view/2160
Coleman, C. N. (2020). Managing Bias When Library Collections Become Data. International Journal of Librarianship, 5(1), 8–19. https://doi.org/10.23974/ijol.2020.vol5.1.162
Coveney, P. V., Dougherty, E. R., & Highfield, R. R. (2016). Big Data Need Big Theory Too. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 374, 1-11. https://doi.org/10.1098/rsta.2016.0153
Feinberg, M. (2017). A Design Perspective on Data. Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, 2952–2963. https://doi.org/10.1145/3025453.3025837
Jo, E. S., & Gebru, T. (2020). Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning. Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, 306–316. https://doi.org/10.1145/3351095.3372829
Prabhu, V. U., & Birhane, A. (2020). Large Image Datasets: A Pyrrhic Win for Computer Vision? ArXiv. http://arxiv.org/abs/2006.16923
Richardson, R., Schultz, J. M., & Crawford, K. (2019). Dirty Data, Bad Predictions: How Civil Rights Violations Impact Police Data, Predictive Policing Systems, and Justice. NYU Law Review, 94(15), 15–55. https://www.nyulawreview.org/online-features/dirty-data-bad-predictions-how-civil-rights-violations-impact-police-data-predictive-policing-systems-and-justice/
Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P. K., & Aroyo, L. (2021). “Everyone Wants to Do the Model Work, Not the Data Work”: Data Cascades in High-Stakes AI. Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, 1-15. https://doi.org/10.1145/3411764.3445518
Scheuerman, M. K., Denton, E., & Hanna, A. (2021). Do Datasets Have Politics? Disciplinary Values in Computer Vision Dataset Development. ArXiv. https://doi.org/10.1145/3476058
Scheuerman, M. K., Paul, J. M., & Brubaker, J. R. (2019). How Computers See Gender: An Evaluation of Gender Classification in Commercial Facial Analysis Services. Proceedings of the ACM on Human-Computer Interaction, 3(CSCW), 1-33. https://doi.org/10.1145/3359246
Scheuerman, M. K., Wade, K., Lustig, C., & Brubaker, J. R. (2020). How We’ve Taught Algorithms to See Identity: Constructing Race and Gender in Image Databases for Facial Analysis. Proceedings of the ACM on Human-Computer Interaction, 4(CSCW1), 1-35. https://doi.org/10.1145/3392866
Smits, T., & Wevers, M. (2021). The Agency of Computer Vision Models as Optical Instruments. Visual Communication, 1-21. https://doi.org/10.1177/1470357221992097
Stevens, N., & Keyes, O. (2021). Seeing infrastructure: Race, Facial Recognition and the Politics of Data. Cultural Studies, 35(4-5), 833-853. https://doi.org/10.1080/09502386.2021.1895252
Trewin, S. (2018). AI Fairness for People with Disabilities: Point of View. ArXiv. http://arxiv.org/abs/1811.10670

b. Technical Approaches to Studying Datasets

Here, we introduce works that detail “technical” methods for the study of datasets. While the titles housed under the following subsection 5c, “Technical Audits,” deal with the investigative technical analysis of particular datasets, the works in this subsection are more concerned with introducing technical methods to approach the study of datasets and their particular components. Many of these studies do contain audit-style analyses, but we differentiate them from subsection 5c because their focus is on introducing or using technical methods for dataset analysis in general, as opposed to dissecting various components of particular datasets.

Balayn, A., Kulynych, B., & Guerses, S. (2021). Exploring Data Pipelines through the Process Lens: A Reference Model for Computer Vision. ArXiv. https://arxiv.org/abs/2107.01824
Bender, E. M., Gebru, T., McMillan-Major, A., & Mitchell, M. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜. FAccT. https://doi.org/10.1145/3442188.3445922
Blodgett, S. L., Lopez, G., Olteanu, A., Sim, R., & Wallach, H. (202). Stereotyping Norwegian Salmon: An Inventory of Pitfalls in Fairness Benchmark Datasets. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, 1, 1004-1015.
Cheng, V., Suriyakumar, V., Dullerud, N., Joshi, S., & Ghassemi, M. (2021). Can You Fake It Until You Make It?: Impacts of Differentially Private Synthetic Data on Downstream Classification Fairness. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 149-160. https://doi.org/10.1145/3442188.3445879
Gardner, M., Merrill, W., Dodge, J., Peters, M. E., Ross, A., Singh, S., & Smith, N. A. (2021). Competency Problems: On Finding and Removing Artifacts in Language Data. ArXiv. https://arxiv.org/abs/2104.08646
Fabbrizzi, S., Papadopoulos, S., Ntoutsi, E., & Kompatsiaris, Y. (2021). A Survey on Bias in Visual Datasets. ArXiv. https://arxiv.org/abs/2107.07919
Hirota, Y., Nakashima, Y., & Garcia, N. (2022). Gender and Racial Bias in Visual Question Answering Datasets. 2022 ACM Conference on Fairness, Accountability, and Transparency, 1280–1292. https://doi.org/10.1145/3531146.3533184
Hutchinson, B., Rostamzadeh, N., Greer, C., Heller, K., & Prabhakaran, V. (2022). Evaluation Gaps in Machine Learning Practice. 2022 ACM Conference on Fairness, Accountability, and Transparency, 1859–1876. https://doi.org/10.1145/3531146.3533233
Jung, T., Kang, D., Mentch, L., & Hovy, E. (2019). Earlier Isn’t Always Better: Sub-aspect Analysis on Corpus and System Biases in Summarization. ArXiv. http://arxiv.org/abs/1908.11723
Kilgarriff, A., & Grefenstette, G. (2003). Introduction to the Special Issue on the Web as Corpus. Computational Linguistics, 29(3), 333–348. https://doi.org/10.1162/089120103322711569
Koesten, L., Vougiouklis, P., Simperl, E., & Groth, P. (2020). Dataset Reuse: Toward Translating Principles to Practice. Patterns, 1(8), 100136. https://doi.org/10.1016/j.patter.2020.100136
Laranjeira da Silva, C., Macedo, J., Avila, S., & dos Santos, J. (2022). Seeing without Looking: Analysis Pipeline for Child Sexual Abuse Datasets. 2022 ACM Conference on Fairness, Accountability, and Transparency, 2189–2205. https://doi.org/10.1145/3531146.3534636
Madras, D., Creager, E., Pitassi, T., & Zemel, R. (2019). Fairness through Causal Awareness: Learning Causal Latent-Variable Models for Biased Data. Proceedings of the Conference on Fairness, Accountability, and Transparency, 349–358. https://doi.org/10.1145/3287560.3287564
Moreno-Torres, J. G., Raeder, T., Alaiz-Rodríguez, R., Chawla, N. V., & Herrera, F. (2012). A Unifying View on Dataset Shift in Classification. Pattern Recognition, 45(1), 521–530. https://doi.org/10.1016/j.patcog.2011.06.019
Olson, R. S., La Cava, W., Orzechowski, P., Urbanowicz, R. J., & Moore, J. H. (2017). PMLB: A Large Benchmark Suite for Machine Learning Evaluation and Comparison. BioData Mining, 10(36). https://doi.org/10.1186/s13040-017-0154-4
Rabanser, S., Günnemann, S., & Lipton, Z. C. (2019). Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift. ArXiv. http://arxiv.org/abs/1810.11953
Rieke, A., Sutherland, V., Svirsky, D., & Hsu, M. (2022). Imperfect Inferences: A Practical Assessment. 2022 ACM Conference on Fairness, Accountability, and Transparency, 767-777. https://doi.org/10.1145/3531146.3533140
Straw, I., & Callison-Burch, C. (2020). Artificial Intelligence in Mental Health and the Biases of Language Based Models. PLOS ONE, 15(12), e0240376. https://doi.org/10.1371/journal.pone.0240376
Welty, C., Paritosh, P., & Aroyo, L. (2019). Metrology for AI: From Benchmarks to Instruments. ArXiv. https://arxiv.org/abs/1911.01875v1
Wesley, A. M., & Matisziw, T. C. (2021). Methods for Measuring Geodiversity in Large Overhead Imagery Datasets. IEEE Access, 9, 100279–100293. https://doi.org/10.1109/ACCESS.2021.3096034
Zanella-Béguelin, S., Wutschitz, L., Tople, S., Rühle, V., Paverd, A., Ohrimenko, O., Köpf, B., & Brockschmidt, M. (2020). Analyzing Information Leakage of Updates to Natural Language Models. Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, 363–375. https://doi.org/10.1145/3372297.3417880
Zhong, R., Chen, Y., Patton, D., Selous, C., & McKeown, K. (2019). Detecting and Reducing Bias in a High Stakes Domain. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 4765–4775. https://doi.org/10.18653/v1/D19-1483

c. Technical Audits

This subsection includes works that employ technical audit-style investigations (e.g., Buolamwini & Gebru, 2018; Raji et al, 2020) of particular datasets.

Babaeianjelodar, M., Lorenz, S., Gordon, J., Matthews, J., & Freitag, E. (2020). Quantifying Gender Bias in Different Corpora. Companion Proceedings of the Web Conference 2020, 752–759. https://doi.org/10.1145/3366424.3383559
Bountouridis, D., Makhortykh, M., Sullivan, E., Harambam, J., Tintarev, N., & Hauff, C. (2019). Annotating Credibility: Identifying and Mitigating Bias in Credibility Datasets. ROME 2019 - Workshop on Reducing Online Misinformation Exposure. https://rome2019.github.io/papers/Bountouridis_etal_ROME2019.pdf
Buolamwini, J., & Gebru, T. (2018, January). Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. Conference on fairness, accountability and transparency, 77-91. https://www.media.mit.edu/publications/gender-shades-intersectional-accuracy-disparities-in-commercial-gender-classification/
Costanza-Chock, S., Raji, I. D., & Buolamwini, J. (2022). Who Audits the Auditors? Recommendations from a field scan of the algorithmic auditing ecosystem. 2022 ACM Conference on Fairness, Accountability, and Transparency, 1571–1583. https://doi.org/10.1145/3531146.3533213
Davidson, T., Bhattacharya, D., & Weber, I. (2019). Racial Bias in Hate Speech and Abusive Language Detection Datasets. ArXiv. http://arxiv.org/abs/1905.12516
Dulhanty, C., & Wong, A. (2019). Auditing ImageNet: Towards a Model-driven Framework for Annotating Demographic Attributes of Large-Scale Image Datasets. ArXiv. http://arxiv.org/abs/1905.01347
Dulhanty, C., & Wong, A. (2020). Investigating the Impact of Inclusion in Face Recognition Training Data on Individual Face Identification. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, 244–250. https://doi.org/10.1145/3375627.3375875
Dulhanty, C. (2020). Issues in Computer Vision Data Collection: Bias, Consent, and Label Taxonomy [University of Waterloo]. https://uwspace.uwaterloo.ca/handle/10012/16414
Heinzerling, B. (2019, July 21). NLP’s Clever Hans Moment has Arrived. Benjamin Heinzerling. https://bheinzerling.github.io/post/clever-hans/
Hutchinson, B., Prabhakaran, V., Denton, E., Webster, K., Zhong, Y., & Denuyl, S. (2020). Social Biases in NLP Models as Barriers for Persons with Disabilities. ArXiv. http://arxiv.org/abs/2005.00813
Klockmann, V., von Schenk, A., & Villeval, M. C. (2021). Artificial Intelligence, Ethics, and Diffused Pivotality. Working Paper Series, GATE. https://ssrn.com/abstract=3853829
Luccioni, A., & Viviano, J. (2021). What’s in the Box? A Preliminary Analysis of Undesirable Content in the Common Crawl Corpus. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 182-189. https://aclanthology.org/2021.acl-short.24.pdf
Mecati, M., Cannavò, F. E., Vetrò, A., & Torchiano, M. (2020). Identifying Risks in Datasets for Automated Decision–Making. In G. Viale Pereira, M. Janssen, H. Lee, I. Lindgren, M. P. Rodríguez Bolívar, H. J. Scholl, & A. Zuiderwijk (Eds.), Electronic Government (pp. 332–344). Springer International Publishing. https://doi.org/10.1007/978-3-030-57599-1_25
Raji, I. D., & Fried, G. (2021). About Face: A Survey of Facial Recognition Evaluation. ArXiv. http://arxiv.org/abs/2102.00813
Raji, I. D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., & Denton, E. (2020). Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing. ArXiv. http://arxiv.org/abs/2001.00964
Rambachan, A., & Roth, J. (2020). Bias In, Bias Out? Evaluating the Folk Wisdom. ArXiv. https://doi.org/10.4230/LIPIcs.FORC.2020.6
Shankar, S., Halpern, Y., Breck, E., Atwood, J., Wilson, J., & Sculley, D. (2017). No Classification without Representation: Assessing Geodiversity Issues in Open Data Sets for the Developing World. ArXiv. https://arxiv.org/abs/1711.08536
Vidgen, B., & Derczynski, L. (2020). Directions in Abusive Language Training Data: Garbage In, Garbage Out. ArXiv. https://arxiv.org/abs/2004.01670
Wang, T., Zhao, J., Yatskar, M., Chang, K.-W., & Ordonez, V. (2019). Balanced Datasets Are Not Enough: Estimating and Mitigating Gender Bias in Deep Image Representations. 2019 IEEE/CVF International Conference on Computer Vision (ICCV). https://doi.org/10.1109/ICCV.2019.00541

d. Visual & Artistic Approaches to Datasets

This final subsection assembles artistic and visual approaches/formats for the analysis of datasets.

Baker, D. (2022). Datasets Have Worldviews [Website]. PAIR Explorables. https://pair.withgoogle.com/explorables/dataset-worldviews/
Crawford, K. & Paglen, T. (2019). Training Humans [Large-scale exhibition]. Fondazione Prada, Milan, 2019-2020. https://www.fondazioneprada.org/project/training-humans/?lang=enPublication: Training Humans Book
Dewey-Hagbord, H. (2019). How Do You See Me? [Adversarial processes]. The Photographer’s Gallery, London, UK. https://thephotographersgallery.org.uk/whats-on/heather-dewey-hagborg-how-do-you-see-me
Malevé, N. (2019).12 hours of ImageNet [Computer script]. The Photographer’s Gallery, London, UK. https://thephotographersgallery.org.uk/whats-on/exhibiting-imagenet
Paglen, T. and Crawford, K. (2019). Imagenet Roulette [Software program]. Launched at SXSW. https://www.youtube.com/watch?v=S0yEPZJnvgs
Pipkin, E. (2020). On Lacework: Watching an Entire Machine-Learning Dataset. Unthinking Photography. https://unthinking.photography/articles/on-lacework
Ridler, A. (2018). Myriad (Tulips) [C-type digital prints with handwritten annotations, magnetic paint, magnets]. Barbican Centre, London, UK. http://annaridler.com/myriad-tulips

RESPONSES TO DATASET PROBLEMS

Here we assemble literature that proposes responses to commonly identified sociotechnical problems with ML datasets. Most of the articles in this vein focused on technical responses to addressing bias (writ broadly), while a few address other concerns such as privacy and security. We do not necessarily endorse these approaches; rather, this is a loose mapping of emerging areas of focus in response to problems. Note that there is some overlap with the readings suggested in Section 5, as many of these papers investigate particular datasets; however, the papers listed here emphasize approaches to addressing specific problems.

a. General Recommendations for Dataset Design

This subsection covers miscellaneous broad recommendations for the creation of fairer and more accountable datasets.

Andrus, M., & Villeneuve, S. (2022). Demographic-Reliant Algorithmic Fairness: Characterizing the Risks of Demographic Data Collection in the Pursuit of Fairness. 2022 ACM Conference on Fairness, Accountability, and Transparency, 1709–1721. https://doi.org/10.1145/3531146.3533226
Bilstrup, K.-E. K., Kaspersen, M. H., Assent, I., Enni, S., & Petersen, M. G. (2022). From Demo to Design in Teaching Machine Learning. 2022 ACM Conference on Fairness, Accountability, and Transparency, 2168–2178. https://doi.org/10.1145/3531146.3534634
Bowman, S. R., & Dahl, G. E. (2021). What Will it Take to Fix Benchmarking in Natural Language Understanding? NAACL. https://doi.org/10.18653/V1/2021.NAACL-MAIN.385
Boyd, K. (2022). Designing Up with Value-Sensitive Design: Building a Field Guide for Ethical ML Development. 2022 ACM Conference on Fairness, Accountability, and Transparency, 2069–2082. https://doi.org/10.1145/3531146.3534626
Kiela, D., Bartolo, M., Nie, Y., Kaushik, D., Geiger, A., Wu, Z., Vidgen, B., Prasad, G., Singh, A., Ringshia, P., Ma, Z., Thrush, T., Riedel, S., Waseem, Z., Stenetorp, P., Jia, R., Bansal, M., Potts, C., & Williams, A. (2021). Dynabench: Rethinking Benchmarking in NLP. NAACL. https://doi.org/10.18653/V1/2021.NAACL-MAIN.324
Panch, T., Pollard, T. J., Mattie, H., Lindemer, E., Keane, P. A., & Celi, L. A. (2020). “Yes, But Will It Work for My Patients?” Driving Clinically Relevant Research with Benchmark Datasets. Npj Digital Medicine, 3(1), 1–4. https://doi.org/10.1038/s41746-020-0295-6
Peng, K., Mathur, A., & Narayanan, A. (2021). Mitigating Dataset Harms Requires Stewardship: Lessons from 1000 Papers. ArXiv. http://arxiv.org/abs/2108.02922
Rogers, A. (2020). Changing the World by Changing the Data. ArXiv. https://arxiv.org/abs/2105.13947
Rolf, E., Worledge, T., Recht, B., & Jordan, M. I. (2021). Representation Matters: Assessing the Importance of Subgroup Allocations in Training Data. ArXiv. https://arxiv.org/abs/2103.03399
Selbst, A. D., Boyd, D., Friedler, S. A., Venkatasubramanian, S., & Vertesi, J. (2019). Fairness and Abstraction in Sociotechnical Systems. Proceedings of the Conference on Fairness, Accountability, and Transparency, 59–68. https://doi.org/10.1145/3287560.3287598
Suresh, H., Movva, R., Lee Dogan, A., Bhargava, D., Isadora, C., Martinez Cuba, A., Taurino, G., So, W., & D’Ignazio, C. (2022). Towards Intersectional Feminist and Participatory ML: A Case Study in Supporting Femicide Counterdata Collection. 2022 ACM Conference on Fairness, Accountability, and Transparency, 667-678. https://doi.org/10.1145/3531146.3533132
Stasaski, K., Yang, G. H., & Hearst, M. A. (2020). More Diverse Dialogue Datasets via Diversity-Informed Data Collection. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 4958–4968. https://doi.org/10.18653/v1/2020.acl-main.446

b. Creating New Datasets and/or Remediation of Existing Datasets

This subsection includes articles that either remediate specific existing datasets or detail the creation of alternative datasets to address identified privacy and bias issues.

Asano, Y., Rupprecht, C., Zisserman, A., & Vedaldi, A. (2021). PASS: An ImageNet Replacement for Self-Supervised Pretraining Without Humans. ArXiv. https://arxiv.org/abs/2109.13228
Brown, H., Lee, K., Mireshghallah, F., Shokri, R., & Tramèr, F. (2022). What Does it Mean for a Language Model to Preserve Privacy? 2022 ACM Conference on Fairness, Accountability, and Transparency, 2280–2292. https://doi.org/10.1145/3531146.3534642
Cai, W., Encarnacion, R., Chern, B., Corbett-Davies, S., Bogen, M., Bergman, S., & Goel, S. (2022). Adaptive Sampling Strategies to Construct Equitable Training Datasets. 2022 ACM Conference on Fairness, Accountability, and Transparency, 1467–1478. https://doi.org/10.1145/3531146.3533203
Jernite, Y., Nguyen, H., Biderman, S., Rogers, A., Masoud, M., Danchev, V., Tan, S., Luccioni, A. S., Subramani, N., Johnson, I., Dupont, G., Dodge, J., Lo, K., Talat, Z., Radev, D., Gokaslan, A., Nikpoor, S., Henderson, P., Bommasani, R., & Mitchell, M. (2022). Data Governance in the Age of Large-Scale Data-Driven Language Technology. 2022 ACM Conference on Fairness, Accountability, and Transparency, 2206–2222. https://doi.org/10.1145/3531146.3534637
Khashabi, D., Chaturvedi, S., Roth, M., Upadhyay, S., & Roth, D. (2018). Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1, 252–262. https://doi.org/10.18653/v1/N18-1023
Yang, K., Qinami, K., Fei-Fei, L., Deng, J., & Russakovsky, O. (2020). Towards Fairer Datasets: Filtering and Balancing the Distribution of the People Subtree in the ImageNet Hierarchy. FAT* '20: Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, 547-558. https://doi.org/10.1145/3351095.3375709
Yang, K., Yau, J., Fei-Fei, L., Deng, J., & Russakovsky, O. (2021). A Study of Face Obfuscation in ImageNet. ArXiv. https://arxiv.org/abs/2103.06191
Zellers, R., Bisk, Y., Schwartz, R., & Choi, Y. (2018). SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference. ArXiv. https://arxiv.org/abs/1808.05326v1

c. Data Annotation Workflows

Articles in this subsection address biased machine learning datasets by proposing changes to data annotation processes.

Barbosa, N. M., & Chen, M. (2019). Rehumanized Crowdsourcing: A Labeling Framework Addressing Bias and Ethics in Machine Learning. Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, 1–12. http://doi.org/10.1145/3290605.3300773
Beretta, E., Vetrò, A., Lepri, B., & Martin, J. C. D. (2021). Detecting Discriminatory Risk Through Data Annotation Based on Bayesian Inferences. FAccT. https://doi.org/10.1145/3442188.3445940
Beretta, E., Vetrò, A., Lepri, B., & De Martin, J. C. (2019). Ethical and Socially-Aware Data Labels. In J. A. Lossio-Ventura, D. Muñante, & H. Alatrista-Salas (Eds.), Information Management and Big Data, 320–327. Springer International Publishing. https://doi.org/10.1007/978-3-030-11680-4_30
Rateike, M., Majumdar, A., Mineeva, O., Gummadi, K. P., & Valera, I. (2022). Don’t Throw it Away! The Utility of Unlabeled Data in Fair Decision Making. 2022 ACM Conference on Fairness, Accountability, and Transparency, 1421–1433. https://doi.org/10.1145/3531146.3533199

d. Data Augmentation

Articles in this subsection offer approaches to reducing bias in datasets by changing their composition via techniques such as oversampling or the use of synthetic/pseudo-data.

Iosifidis, V., & Ntoutsi, E. (2018). Dealing with Bias via Data Augmentation in Supervised Learning Scenarios. http://ceur-ws.org/Vol-2103/paper_5.pdf
Pastaltzidis, I., Dimitriou, N., Quezada-Tavarez, K., Aidinlis, S., Marquenie, T., Gurzawska, A., & Tzovaras, D. (2022). Data augmentation for fairness-aware machine learning: Preventing algorithmic bias in law enforcement systems. 2022 ACM Conference on Fairness, Accountability, and Transparency, 2302–2314. https://doi.org/10.1145/3531146.3534644
Sharma, S., Zhang, Y., Ríos Aliaga, J. M., Bouneffouf, D., Muthusamy, V., & Varshney, K. R. (2020). Data Augmentation for Discrimination Prevention and Bias Disambiguation. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, 358–364. https://doi.org/10.1145/3375627.3375865
Tomalin, M., Byrne, B., Concannon, S., Saunders, D., & Ullmann, S. (2021). The Practical Ethics of Bias Reduction in Machine Translation: Why Domain Adaptation is Better than Data Debiasing. Ethics and Information Technology, 23, 419-433. https://doi.org/10.1007/s10676-021-09583-1

e. Bias Detection

This subsection gathers tools and approaches for detecting bias in datasets.

Chapman, A., Grylls, P., Ugwudike, P., Gammack, D., & Ayling, J. (2022). A Data-Driven Analysis of the Interplay Between Criminology Theory and Predictive Policing Algorithms. 2022 ACM Conference on Fairness, Accountability, and Transparency, 36-45. https://doi.org/10.1145/3531146.3533071
Goyal, P., Romero Soriano, A., Hazirbas, C., Levent, S., & Usunier, N. (2022). Fairness Indicators for Systematic Assessments of Visual Feature Extractors. 2022 ACM Conference on Fairness, Accountability, and Transparency, 70-88. https://doi.org/10.1145/3531146.3533074
Harris, C., Halevy, M., Howard, A., Bruckman, A., & Yang, D. (2022). Exploring the Role of Grammar and Word Choice in Bias Toward African American English (AAE) in Hate Speech Classification. 2022 ACM Conference on Fairness, Accountability, and Transparency, 789-798. https://doi.org/10.1145/3531146.3533144
Hu, X., Wang, H., Vegesana, A., Dube, S., Yu, K., Kao, G., Chen, S.-H., Lu, Y.-H., Thiruvathukal, G. K., & Yin, M. (2020). Crowdsourcing Detection of Sampling Biases in Image Datasets. Proceedings of The Web Conference 2020, 2955–2961. https://doi.org/10.1145/3366423.3380063
Leavy, S., Meaney, G., Wade, K., & Greene, D. (2020). Mitigating Gender Bias in Machine Learning Data Sets. In L. Boratto, S. Faralli, M. Marras, & G. Stilo (Eds.), Bias and Social Aspects in Search and Recommendation, 12–26. Springer International Publishing. https://doi.org/10.1007/978-3-030-52485-2_2
Pahl, J., Rieger, I., Mӧller, A., Wittenberg, T., & Schmid, U. (2022). Female, White, 27? Bias Evaluation on Data and Algorithms for Affect Recognition in Faces. 2022 ACM Conference on Fairness, Accountability, and Transparency, 973-987. https://doi.org/10.1145/3531146.3533159
Srinivasan, R., & Chander, A. (n.d.). Understanding Bias in Datasets using Topological Data Analysis. 7. http://ceur-ws.org/Vol-2419/paper_9.pdf
Verma, S., Ernst, M., & Just, R. (2021). Removing Biased Data to Improve Fairness and Accuracy. ArXiv. https://arxiv.org/abs/2102.03054
Wang, A., Barocas, S., Laird, K., & Wallach, H. (2022). Measuring Representational Harms in Image Captioning. 2022 ACM Conference on Fairness, Accountability, and Transparency, 324-335. https://doi.org/10.1145/3531146.3533099
Wang, A., Narayanan, A., & Russakovsky, O. (2020). REVISE: A Tool for Measuring and Mitigating Bias in Visual Datasets. ECCV, 733-751. https://doi.org/10.1007/978-3-030-58580-8_43
Wang, A., Ramaswamy, V. V., & Russakovsky, O. (2022). Towards Intersectionality in Machine Learning: Including More Identities, Handling Underrepresetation, and Performing Evaluation. 2022 ACM Conference on Fairness, Accountability, and Transparency, 336-349. https://doi.org/10.1145/3531146.3533101
Zamfirescu-Pereira, J. D., Chen, J., Wen, E, Koenecke, A., Garg, N., & Pierson, E. (2022) Trucks Don’t Mean Trump: Diagnosing Human Error in Image Analysis. 2022 ACM Conference on Fairness, Accountability, and Transparency, 799-813. https://doi.org/10.1145/3531146.3533145

f. Algorithms to Debias Datasets or Mitigate Bias

Research in this subsection deploys algorithmic techniques to either debias datasets before training ML models on them or intervene to mitigate bias after training.

Abbasi-Sureshjani, S., Raumanns, R., Michels, B. E. J., Schouten, G., & Cheplygina, V. (2020). Risk of Training Diagnostic Algorithms on Data with Demographic Bias. In J. Cardoso et al (Eds.), Interpretable and Annotation-Efficient Learning for Medical Image Computing, 183–192. Springer. https://doi.org/10.1007/978-3-030-61166-8_20
Almuzaini, A. A., Bhatt, C. A., Pennock, D. M., & Singh, V. K. (2022). ABCinML: Anticipatory Bias Correction in Machine Learning Applications. 2022 ACM Conference on Fairness, Accountability, and Transparency, 1552–1560. https://doi.org/10.1145/3531146.3533211
Anahideh, H., Asudeh, A., & Thirumuruganathan, S. (2021). Fair Active Learning. ArXiv. http://arxiv.org/abs/2001.0179
Bolukbasi, T., Chang, K.-W., Zou, J., Saligrama, V., & Kalai, A. (2016). Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings. ArXiv. http://arxiv.org/abs/1607.06520
Hendricks, L. A., Burns, K., Saenko, K., Darrell, T., & Rohrbach, A. (2018). Women Also Snowboard: Overcoming Bias in Captioning Models. ECCV, 771–787. https://openaccess.thecvf.com/content_ECCV_2018/html/Lisa_Anne_Hendricks_Women_also_Snowboard_ECCV_2018_paper.html
Lum, K., Zhang, Y., & Bower, A. (2022). De-Biasing “Bias” Measurement. 2022 ACM Conference on Fairness, Accountability, and Transparency, 379-389. https://doi.org/10.1145/3531146.3533105
Reimers, C., Bodesheim, P., Runge, J., & Denzler, J. (2021). Towards Learning an Unbiased Classifier from Biased Data via Conditional Adversarial Debiasing. ArXiv. https://arxiv.org/abs/2103.06179
Ryu, H. J., Mitchell, M., & Adam, H. (2017). InclusiveFaceNet: Improving Face Attribute Detection with Race and Gender Diversity. ArXiv. https://arxiv.org/abs/1712.00193
Schick, T., Udupa, S., & Schütze, H. (2021). Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP. ArXiv. https://arxiv.org/abs/2103.00453
Sikdar, S., Lemmerich, F., & Strohmaier, M. (2022). GetFair: Generalized Fairness Tuning of Classification Models. 2022 ACM Conference on Fairness, Accountability, and Transparency, 289-299. https://doi.org/10.1145/3531146.3533094
Zhao, J., Wang, T., Yatskar, M., Ordonez, V., & Chang, K.-W. (2017). Men Also Like Shopping: Reducing Gender Bias Amplification Using Corpus-level Constraints. ArXiv. http://arxiv.org/abs/1707.09457

DATASET DOCUMENTATION PRACTICES

In recent years, there have been calls to increase transparency and standardization for ML datasets so that researchers can better study their composition and effects, as well as identify problems. This section collects these various approaches to dataset documentation.

Bandy, J., & Vincent, N. (2021). Addressing “Documentation Debt” in Machine Learning Research: A Retrospective Datasheet for BookCorpus. ArXiv. https://arxiv.org/abs/2105.05241
Barclay, I., Preece, A., Taylor, I., Radha, S. K., & Nabrzyski, J. (2021). Providing Assurance and Scrutability on Shared Data and Machine Learning Models with Verifiable Credentials. ArXiv. http://arxiv.org/abs/2105.06370
Barclay, I., Preece, A., Taylor, I., & Verma, D. (2019). Towards Traceability in Data Ecosystems Using a Bill of Materials Model. ArXiv. http://arxiv.org/abs/1904.04253
Bender, E. M., & Friedman, B. (2018). Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science. Transactions of the Association for Computational Linguistics, 6, 587–604. https://doi.org/10.1162/tacl_a_00041
Benjamin, M., Gagnon, P., Rostamzadeh, N., Pal, C., Bengio, Y., & Shee, A. (2019). Towards Standardization of Data Licenses: The Montreal Data License. ArXiv. https://doi.org/10.48550/arXiv.1903.12262
Boyd, K. (2020). Understanding and Intervening in Machine Learning Ethics: Supporting Ethical Sensitivity in Training Data Curation. ProQuest [University of Maryland, College Park]. https://www.proquest.com/openview/046800aae7b57cc51efdc1caa7a84cba/1?pq-origsite=gscholar&cbl=18750&diss=y
Crisan, A., Drouhard, M., Vig, J., & Rajani, N. (2022). Interactive Model Cards: A Human-Centered Approach to Model Documentation. 2022 ACM Conference on Fairness, Accountability, and Transparency, 427-439. https://doi.org/10.1145/3531146.3533108
Díaz, M., Kivlichan, I., Rosen, R., Baker, D., Amironesei, R., Prabhakaran, V., & Denton, E. (2022). CrowdWorkSheets: Accounting for Individual and Collective Identities Underlying Crowdsourced Dataset Annotation. 2022 ACM Conference on Fairness, Accountability, and Transparency, 2342–2351. https://doi.org/10.1145/3531146.3534647
Fabris, A., Messina, S., Silvello, G., & Susto, G. A. (2022). Algorithmic Fairness Datasets: The Story so Far. ArXiv. https://doi.org/10.48550/arXiv.2202.01711
Gansky, B., & McDonald, S. (2022). CounterFAccTual: How FAccT Undermines Its Organizing Principles. 2022 ACM Conference on Fairness, Accountability, and Transparency, 1982–1992. https://doi.org/10.1145/3531146.3533241
Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Daumé III, H., & Crawford, K. (2020). Datasheets for Datasets. ArXiv. http://arxiv.org/abs/1803.09010
Holland, S., Hosny, A., Newman, S., Joseph, J., & Chmielinski, K. (2018). The Dataset Nutrition Label: A Framework To Drive Higher Data Quality Standards. ArXiv. http://arxiv.org/abs/1805.03677
Luccioni, A. S., Corry, F., Sridharan, H., Ananny, M., Schultz, J., & Crawford, K. (2022). A Framework for Deprecating Datasets: Standardizing Documentation, Identification, and Communication. 2022 ACM Conference on Fairness, Accountability, and Transparency, 199–212. https://doi.org/10.1145/3531146.3533086
McMillan-Major, A., Osei, S., Rodriguez, J. D., Ammanamanchi, P. S., Gehrmann, S., & Jernite, Y. (2021). Reusable Templates and Guides For Documenting Datasets and Models for Natural Language Processing and Generation: A Case Study of the HuggingFace and GEM Data and Model Cards. ArXiv. https://arxiv.org/abs/2108.07374
Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer, E., Raji, I. D., & Gebru, T. (2019). Model Cards for Model Reporting. Proceedings of the Conference on Fairness, Accountability, and Transparency, 220–229. https://doi.org/10.1145/3287560.3287596
Pushkarna, M., Zaldivar, A., & Kjartansson, O. (2022). Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI. 2022 ACM Conference on Fairness, Accountability, and Transparency, 1776–1826. https://doi.org/10.1145/3531146.3533231
Rostamzadeh, N., Mincu, D., Roy, S., Smart, A., Wilcox, L., Pushkarna, M., Schrouff, J., Amironesei, R., Moorosi, N., & Heller, K. (2022). Healthsheet: Development of a Transparency Artifact for Health Datasets. 2022 ACM Conference on Fairness, Accountability, and Transparency,, 1943–1961. https://doi.org/10.1145/3531146.3533239
Seck, I., Dahmane, K., Duthon, P., & Loosli, G. (2018). Baselines and a Datasheet for the Cerema AWP dataset. ArXiv. http://arxiv.org/abs/1806.04016
Schramowski, P., Tauchmann, C., & Kersting, K. (2022). Can Machines Help Us Answering Question 16 in Datasheets, and In Turn Reflecting on Inappropriate Content? 2022 ACM Conference on Fairness, Accountability, and Transparency, 1350–1361. https://doi.org/10.1145/3531146.3533192
Srinivasan, R., Denton, E., Famularo, J., Rostamzadeh, N., Diaz, F., & Coleman, B. (2021). Artsheets for Art Datasets. Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). https://openreview.net/forum?id=K7ke_GZ_6N
Zhang, W., Ohrimenko, O., & Cummings, R. (2022). Attribute Privacy: Framework and Mechanisms. 2022 ACM Conference on Fairness, Accountability, and Transparency, 757-766. https://doi.org/10.1145/3531146.3533139

CONFERENCES FOCUSED ON DATASETS

The scholarship summarized in this list spans academic fields, from science and technology studies (STS) to computer science, and human computer interaction (HCI) to library science. During the construction of this list, it became clear that certain conference venues and their proceedings are often associated with emerging work on training data. Other conference venues have dedicated workshops or particular tracks to the study of datasets. While this broader list represents training data scholarship at a particular moment in time, these locales provide sites where work on training data has been concentrated or is likely to be found.

ACM CHI Conference on Human Factors in Computing Systems https://chi2021.acm.org/
ACM Conference on Computer Supported Cooperative Work (CSCW) https://dl.acm.org/conference/cscw
ACM Conference on Fairness, Accountability, and Transparency (ACM FAccT) https://facctconference.org/
NeurIPS Datasets and Benchmarks Track https://neurips.cc/Conferences/2021/CallForDatasetsBenchmarks
NeurIPS Data-Centric AI Workshop (2021) https://nips.cc/Conferences/2021/Schedule?showEvent=21860

PRESS TREATMENT OF DATASETS

Popular press treatments of training data have provided a foundation for broader public conversations about these artifacts. The press gathered here represents just a small sample of both the important investigative work into training data as well as cogent introductions to the subject. Articles are frequently published on these issues, so this is just a selection of starting points.

Argoub, S. (2021, June 9). The NLP Divide: English is Not the Only Natural Language. Polis. https://blogs.lse.ac.uk/polis/2021/06/09/the-nlp-divide-english-is-not-the-only-natural-language/
Buranyi, S. (2017, August 8). Rise of the Racist Robots – How AI is Learning All Our Worst Impulses. The Guardian. http://www.theguardian.com/inequality/2017/aug/08/rise-of-the-racist-robots-how-ai-is-learning-all-our-worst-impulses
Elliott, V. (2021, August 3). Training Self-Driving Cars for $1 an Hour. Rest of World. https://restofworld.org/2021/self-driving-cars-outsourcing/
McQuaid, J. (2021, October 18). Can AI’s Voracious Appetite Be Tamed? Undark Magazine. https://undark.org/2021/10/18/computer-scientists-try-to-sidestep-ai-data-dilemma/
Smith, C. S. (2019, November 19). Dealing With Bias in Artificial Intelligence. The New York Times. https://www.nytimes.com/2019/11/19/technology/artificial-intelligence-bias.html
Feathers, T. (2020, September 17). Fake Data Could Help Solve Machine Learning’s Bias Problem—If We Let It. Slate Magazine. https://slate.com/technology/2020/09/synthetic-data-artificial-intelligence-bias.html
Gershgorn, D. (2018, September 6). If AI is Going to Be the World’s Doctor, It Needs Better Textbooks. Quartz. https://qz.com/1367177/if-ai-is-going-to-be-the-worlds-doctor-it-needs-better-textbooks/
Johnson, K. (2021, June 17). The Efforts to Make Text-Based AI Less Racist and Terrible. Wired. https://www.wired.com/story/efforts-make-text-ai-less-racist-terrible/
Johnson, K. (2021, August 4). This New Way to Train AI Could Curb Online Harassment. Wired. https://www.wired.com/story/new-way-train-ai-curb-online-harassment/
Metz, C. (2019, September 20). ‘Nerd’, ‘Nonsmoker, ‘Wrongdoer,’: How Might A.I. Label You?, The New York Times. https://www.nytimes.com/2019/09/20/arts/design/imagenet-trevor-paglen-ai-facial-recognition.html
Murgia, M., & Harlow, M. (2019, April 19). Who’s Using Your face? The Ugly Truth About Facial Recognition. Financial Times. https://www.ft.com/content/cf19b956-60a2-11e9-b285-3acd5d43599e
Register, Y. L.. (2021, July 22). It’s All Training Data: Using Lessons from Machine Learning to Retrain Your Mind. The Gradient. https://thegradient.pub/its-all-training-data/
Solon, O. (2019, March 12). Facial Recognition’s “Dirty Little Secret”: Social Media Photos Used Without Consent. NBC News. https://www.nbcnews.com/tech/internet/facial-recognition-s-dirty-little-secret-millions-online-photos-scraped-n981921
Solon, O., & Farivar, C. (2019, May 9). Millions of People Uploaded Photos to the Ever App. Then the Company Used Them to Develop Facial Recognition Tools. NBC News. https://www.nbcnews.com/tech/security/millions-people-uploaded-photos-ever-app-then-company-used-them-n1003371