How should we study datasets in machine learning? As machine learning (ML) increasingly becomes a site of sociotechnical inquiry, invoking numerous social, political, legal, and ethical issues, datasets are a crucial component as they are core material used to train models. Inspired by Tarleton Gillespie and Nick Seaver’s Critical Algorithm Studies reading list, this collection is meant to serve as an entry point to the growing literature on ML datasets across the fields of computer science, human-computer interaction, science and technology studies, media studies, and histories of technology, among others. We compiled this list primarily as a resource for researchers seeking to understand—from a variety of perspectives—how ML datasets work, do work, and are worked upon. We hope it will also be of use to technology practitioners and students seeking to build ML systems.
We limit our scope to works that focus on datasets deployed in the training and testing of ML systems, and despite some overlap, this list is not a primer for the field of critical technology studies more generally. Entries are sorted into various sections with the intention of providing readers a preliminary structure that will help them follow their specific interests. We acknowledge that classificatory practice is always subjective and that many of these titles can fit appropriately under multiple sections or named in different ways. The current iteration is a reflection of our own ideas and what we find helpful as a way to organize the emerging literature that we are working with. There are certainly other ways to structure this reading list, and we are open to suggestions that expand its range and improve usability. Our focus is primarily on academic publications, but for those who are more interested in understanding how datasets have been discussed in the press as of July 2022, we offer a selection of examples at the end of the reading list.
This list is also not meant to be exhaustive. We see the list as a living resource and invite readers to make suggestions and contributions via this form if there are key titles that they think should be included. Please note that while all links are functional as of July 2022, we are unable to continuously monitor for updated versions of papers or fix broken links.
Despite these limitations, we hope this reading list might serve as a useful resource for scholars and practitioners investigating ML datasets as sociotechnical assemblages that shape and are shaped by social worlds.
Table of Contents
1. STARTING POINTS
This section contains a broad set of introductory texts and locales to ground the study of training data. Resources included in this section cover the politics, possibilities, and pitfalls of ML training data and offer early provocations for thinking about particular aspects of training data, such as privacy or bias.
- Barocas, S., & Selbst, A. D. (2016). Big Data’s Disparate Impact. California Law Review,104(3), 671–732. https://www.californialawreview.org/wp-content/uploads/2016/06/2Barocas-Selbst.pdf
- Crawford, K. (2021). Atlas of AI: Power, Politics and the Planetary Costs of Artificial Intelligence, see ‘Data’ chapter (pp. 89-122). New Haven, CT: Yale University Press.
- Crawford, K., & Paglen, T. (2019). Excavating AI: The Politics of Images in Machine Learning Training Sets. https://excavating.ai
- Denton, E., Hanna, A., Amironesei, R., Smart, A., Nicole, H., & Scheuerman, M. K. (2020). Bringing the People Back In: Contesting Benchmark Machine Learning Datasets. 6. ArXiv. https://arxiv.org/abs/2007.07399
- Harvey, A. (2021). Exposing.ai: Face and Biometric Image Datasets. https://exposing.ai/datasets/
- MacKenzie, A., & Munster, A. (2019). Platform Seeing: Image Ensembles and Their Invisualities. Theory, Culture & Society, 36(5), 3–22. https://doi.org/10.1177/0263276419847508
- Miceli, M., Posada, J., & Yang, T. (2022). Studying Up Machine Learning Data: Why Talk About Bias When We Mean Power? Proceedings of the ACM on Human-Computer Interaction, 6(GROUP), 1–14. https://doi.org/10.1145/3492853
- Paullada, A., Raji, I. D., Bender, E. M., Denton, E., & Hanna, A. (2020). Data and Its (Dis)Contents: A Survey of Dataset Development and Use in Machine Learning Research. ArXiv. https://arxiv.org/abs/2012.05345v1
- Roberge, J., & Castelle, M. (Eds.). (2020). The Cultural Life of Machine Learning: An Incursion into Critical AI Studies (1st ed. 2021 edition). Palgrave Macmillan.
- Srinivasan, R., & Chander, A. (2021). Biases in AI Systems: A Survey for Practitioners. Queue, 19(2), 45-64. https://doi.org/10.1145/3466132.3466134
- Suresh, H., & Guttag, J. V. (2021). A Framework for Understanding Sources of Harm throughout the Machine Learning Life Cycle. ArXiv. http://arxiv.org/abs/1901.10002
- Thylstrup, N. B. (2022). The Ethics and Politics of Data Sets in the Age of Machine Learning: Deleting Traces and Encountering Remains. Media, Culture & Society. https://doi.org/10.1177/01634437211060226
2. CONTEXTUALIZING THE STUDY OF DATASETS
This section consists of broader foundational readings that don’t all necessarily deal specifically with machine learning datasets, but which the authors of this list have found useful to contextualize their study. We acknowledge that the titles below do not form an exhaustive index of all foundational readings, but point to them as particularly helpful ones for thinking about the ontological and epistemological complexities of the “dataset” as an object/genre of analysis.
a. Politics of Classification
This subsection focuses on classification as a practice of not only world-ordering, but also world-making, and how its logics underlie the ways in which datasets are conceived and built.
- Boutyline, A., & Soter, L. K. Cultural Schemas: What They Are, How to Find Them, and What to Do Once You’ve Caught One. American Sociological Review, 86(4), 728–758. https://doi.org/10.1177/00031224211024525
- Bechmann, A., & Bowker, G. C. (2019). Unsupervised by any other name: Hidden layers of knowledge production in artificial intelligence on social media. Big Data & Society, 6(1). https://doi.org/10.1177/2053951718819569
- Bowker, G. C., & Star, S. L. (2000). Sorting Things Out: Classification and Its Consequences. Cambridge, MA: MIT Press.
- Crawford, K. (2021). Atlas of AI: Power, Politics and the Planetary Costs of Artificial Intelligence, see ‘Classification’ chapter (pp. 123-150). New Haven, CT: Yale University Press.
- Fourcade, M., & Healy, K. (2013). Classification Situations: Life-Chances in the Neoliberal Era. Accounting, Organizations and Society, 38(8), 559-572. https://doi.org/10.1016/j.aos.2013.11.002.
- Goodwin, C. (2000). Practices of Color Classification. Mind, Culture, and Activity, 7(1&2), 19-36. https://doi.org/10.1080/10749039.2000.9677646
- Rieder, B. (2017). Scrutinizing an Algorithmic Technique: The Bayes Classifier as Interested Reading of Reality. Information, Communication & Society, 20(1), 100-117. https://doi.org/10.1080/1369118X.2016.1181195
- Sadre-Orafai, S. (2020). Typologies, Typifications, and Types. Annual Review of Anthropology, 49(1), 193-208. https://doi.org/10.1146/annurev-anthro-102218-011235
b. Critical Data Studies
Here, we introduce a few titles from the emerging field of Critical Data Studies which we believe are especially useful for the purposes of acquiring a nuanced and interdisciplinary understanding of datasets.
- Andrejevic, M. (2019). Automated Media (1st edition). Routledge.
- Beer, D. (2018). The Data Gaze. London, UK: SAGE.
- Cheney-Lippold, J. (2017). We Are Data: Algorithms and the Making of our Digital Selves. New York, NY: NYU Press.
- Chun, W. (2021). Discriminating Data. Cambridge, MA: MIT Press.
- Cifor, M., Garcia, P., Cowan, T. L., Rault, J., Sutherland, T., Chan, A., . . . Nakamura, L. (2019). Feminist Data Manifest-No. Retrieved from https://www.manifestno.com/
- Couldry, N., & Mejias, U. A. (2019). The Costs of Connection: How Data Is Colonizing Human Life and Appropriating It for Capitalism. Stanford, CA: Stanford University Press.
- D’Ignazio, C., & Klein, L. F. (2020). Data Feminism. MIT Press.
- Gitelman L. (2013). “Raw Data” Is an Oxymoron. MIT Press.
- Hansson, K., & Dahlgren, A. (2022). Open research data repositories: Practices, norms, and metadata for sharing images. Journal of the Association for Information Science and Technology , 73(2), 303-316. https://doi.org/10.1002/asi.24571
- Iliadis, A., & Russo, F. (2016). Critical data studies: An introduction. Big Data & Society, 3(2), 1-7. https://doi.org/10.1177/2053951716674238
- Jaton, F. (2021). The Constitution of Algorithms: Ground-Truthing, Programming, Formulating. Cambridge, MA: MIT Press.
- Kitchin, R. (2021). Data Lives. Bristol, UK: Bristol University Press.
- Koopman, C. (2019). How We Became Our Data: A Genealogy of the Informational Person. Chicago, IL: University of Chicago Press.
- Thorp, J. (2021). Living in Data: A Citizen's Guide to a Better Information Future. New York, NY: MCD.
c. Methodologies for Reading Data
This final subsection includes texts that deal more specifically with the different conceptualizations and methodologies through which datasets can be studied/read/analyzed.
- boyd, d., & Crawford, K. (2012). Critical Questions for Big Data: Provocations for a Cultural, Technological, and Scholarly Phenomenon. Information, Communication & Society, 15(5), 662-679. https://doi.org/10.1080/1369118X.2012.678878
- Brock A. (2015). Deeper Data: A Response to boyd and Crawford. Media, Culture & Society, 37(7):1084-1088. https://doi.org/10.1177/0163443715594105
- Driscoll, K., & Walker, S. (2014). Working Within a Black Box: Transparency in the Collection and Production of Big Twitter Data. International Journal of Communication, 8, 1745–1764. https://ijoc.org/index.php/ijoc/article/view/2171/1159
- Kitchin, R. (2014). The Data Revolution: Big Data, Open Data, Data Infrastructures and Their Consequences. London, UK: SAGE.
- Leonelli, S., & Tempini, N. (Eds.). (2020). Data Journeys in the Sciences. Springer International Publishing.
- Malevé, N. (2020). On the Data Set’s Ruins. AI & Society, 36, 1117–1131. https://doi.org/10.1007/s00146-020-01093-w
- Metcalf, J., & Crawford, K. (2016). Where Are Human Subjects in Big Data Research? The Emerging Ethics Divide. Big Data & Society, 3(1), 1-14. https://doi.org/10.1177/2053951716650211
- Munk, A. K., Olesen, A. G., & Jacomy, M. (2022). The Thick Machine: Anthropological AI Between Explanation and Explication. Big Data & Society, 9(1), 1-14. https://doi.org/10.1177/20539517211069891
- Pasquale, F. (2021). Licensure as Data Governance. Knight First Amendment Institute. https://knightcolumbia.org/content/licensure-as-data-governance
- Poirier, L. (2021). Reading Datasets: Strategies for Interpreting the Politics of Data Signification. Big Data & Society, 8(2), 1-19. https://doi.org/10.1177/20539517211029322
- Suchman, L., & Trigg, R. H. (1993). Artificial Intelligence as Craftwork. In S. Chaiklin & J. Lave (Eds.), Understanding Practice (pp. 144-178). New York, NY: Cambridge University Press.
- Zook, M., Barocas, S., boyd, d., Crawford, K., Keller, E., Gangadharan, S. P., Goodman, A., Hollander, R., Koenig, B. A., Metcalf, J., Narayanan, A., Nelson, A., & Pasquale, F. (2017). Ten Simple Rules for Responsible Big Data Research. PLOS Computational Biology, 13(3), e1005399. https://doi.org/10.1371/journal.pcbi.1005399
3. PUBLIC SOURCES OF DATASETS
While some datasets lie behind proprietary company walls, numerous datasets are available for public download. This section lists technical papers that accompany major public dataset releases, as well as popular repositories where disparate datasets are organized and made available to the broader public.
a. Source Papers for Noteworthy Datasets
New training datasets are typically accompanied by technical papers explaining the composition of the dataset and its potential applications. These papers often also include analyses of models using the new dataset and comparisons to similar existing datasets. There are infinitely more dataset source papers than can be included on this list; below is a sampling of the most highly cited and broadly influential releases.
- Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2016). The Cityscapes Dataset for Semantic Urban Scene Understanding. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3213–3223. https://openaccess.thecvf.com/content_cvpr_2016/html/Cordts_The_Cityscapes_Dataset_CVPR_2016_paper.html
- Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A Large-Scale Hierarchical Image Database. 2009 IEEE Conference on Computer Vision and Pattern Recognition, 248–255. https://doi.org/10.1109/CVPR.2009.5206848
- Geiger, A., Lenz, P., & Urtasun, R. (2012). Are We Ready for Autonomous Driving? The KITTI Vision Benchmark Suite. 2012 IEEE Conference on Computer Vision and Pattern Recognition, 3354–3361. https://doi.org/10.1109/CVPR.2012.6248074
- Huang, G. B., Mattar, M., Berg, T., & Learned-Miller, E. (2008). Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments. Workshop on Faces in “Real-Life” Images: Detection, Alignment, and Recognition. https://hal.inria.fr/inria-00321923
- Krizhevsky, A. (2009). Learning Multiple Layers of Features from Tiny Images. https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf
- LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-Based Learning Applied to Document Recognition. Proceedings of the IEEE, 86(11), 2278–2324. http://yann.lecun.com/exdb/publis/index.html#lecun-98
- Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO: Common Objects in Context. In D. Fleet, T. Pajdla, B. Schiele, & T. Tuytelaars (Eds.), Computer Vision – ECCV 2014 (pp. 740–755). Springer International Publishing. https://doi.org/10.1007/978-3-319-10602-1_48
- Marcus, M., Santorini, B., & Marcinkiewicz, M. A. (1993). Building a Large Annotated Corpus of English: The Penn Treebank. Technical Reports (CIS). https://repository.upenn.edu/cis_reports/237
- Martin, D., Fowlkes, C., Tal, D., & Malik, J. (2001). A Database of Human Segmented Natural Images and Its Application to Evaluating Segmentation Algorithms and Measuring Ecological Statistics. Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, 2, 416–423. https://doi.org/10.1109/ICCV.2001.937655
- Miller, G. A. (1995). WordNet: A Lexical Database for English. Communications of the ACM, 38(11), 39–41. https://doi.org/10.1145/219717.219748
- Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A. Y., & Potts, C. (2013). Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 1631–1642. https://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf
- Thomee, B., Shamma, D. A., Friedland, G., Elizalde, B., Ni, K., Poland, D., Borth, D., & Li, L.-J. (2016). YFCC100M: The New Data in Multimedia Research. Communications of the ACM, 59(2), 64–73. https://doi.org/10.1145/2812802
b. Dataset Repositories
These sites provide infrastructure for the organization, finding, and downloading of varying datasets.
- Papers with Code: https://paperswithcode.com/datasets
- UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets.php
- Kaggle: https://www.kaggle.com/datasets
- Hugging Face: https://huggingface.co/datasets
- Google dataset search https://datasetsearch.research.google.com/
4. STUDYING DATASET PRODUCTION
Training data requires significant human and computational effort to create. It is through this process of production that many of the effects of training data come to be shaped, from the processes of collection to labeling, deployment to deprecation. Texts in this section provide glimpses into the work behind datasets from varying angles, whether examining these production processes from a critical lens or describing the overall workflow of training data production from a technical standpoint.
a. Sociotechnical / Critical Approaches to Labor of Training Data
These texts draw on approaches and frameworks from science and technology studies, political economy, and labor studies to examine the production of training data from a critical lens, understanding how power relations are at work in this process.
- Famularo, J., Hensellek, B., & Walsh, P. (2021). Data Stewardship: A Letter to Computer Vision from Cultural Heritage Studies. CVPR 2021. https://www.academia.edu/49423941/Data_Stewardship_A_Letter_to_Computer_Vision_from_Cultural_Heritage_Studies?auto=citations&from=cover_page
- Gray, M. L., & Suri, S. (2019).Ghost Work: How to Stop Silicon Valley from Building a New Global Underclass, see ‘Introduction: Ghosts in the Machine’ (pp. ix-xxxi) and ‘1. Humans in the Loop’ (pp. 1-38). Houghton Mifflin Harcourt.
- Goetze, T. S., & Abramson, D. (2021). Bigger Isn’t Better: The Ethical and Scientific Vices of Extra-Large Datasets in Language Models. WebSci, pp. 69-75. https://doi.org/10.1145/3462741.3466809
- Iliadis, A. (2019). The Tower of Babel problem: Making data make sense with Basic Formal Ontology. Online Information Review, 43(6), 1021–1045. https://doi.org/10.1108/OIR-07-2018-0210
- Jones, P. (2021, September 22). Refugees Help Power Machine Learning Advances at Microsoft, Facebook, and Amazon. Rest of World. https://restofworld.org/2021/refugees-machine-learning-big-tech/
- Miceli, M., Schuessler, M., & Yang, T. (2020). Between Subjectivity and Imposition: Power Dynamics in Data Annotation for Computer Vision. Proceedings of the ACM on Human-Computer Interaction, 4(CSCW2), 1-25. https://doi.org/10.1145/3415186
- Newlands, G. (2021). Lifting the Curtain: Strategic Visibility of Human Labour in AI-as-a-Service. Big Data & Society, 8(1), 1-14. https://doi.org/10.1177/20539517211016026
- Sachs, S. E. (2020). The Algorithm At Work? Explanation and Repair in the Enactment of Similarity in Art Data. Information, Communication & Society, 23(11), 1689–1705. https://doi.org/10.1080/1369118X.2019.1612933
- Sambasivan, N. (2021). Seeing Like a Dataset from the Global South. Interactions, 28(4), 76–78. https://doi.org/10.1145/3466160
- Sap, M., Card, D., Gabriel, S., Choi, Y., & Smith, N. A. (2019). The Risk of Racial Bias in Hate Speech Detection. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 1668–1678. https://doi.org/10.18653/v1/P19-1163
b. Organizational Workflows in Dataset Production
Texts included here look to training data production from a practitioner-oriented lens. They survey either the entire workflow of training data production or specific stages within this process to identify challenges and suggest best practices.
- Amershi, S., Begel, A., Bird, C., DeLine, R., Gall, H., Kamar, E., Nagappan, N., Nushi, B., & Zimmermann, T. (2019). Software Engineering for Machine Learning: A Case Study. 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), 291–300. https://doi.org/10.1109/ICSE-SEIP.2019.00042
- Ashmore, R., Calinescu, R., & Paterson, C. (2019). Assuring the Machine Learning Lifecycle: Desiderata, Methods, and Challenges. ArXiv. http://arxiv.org/abs/1905.04223
- Barclay, I., Taylor, H., Preece, A., Taylor, I., Verma, D., & de Mel, G. (2020). A Framework for Fostering Transparency in Shared Artificial Intelligence Models by Increasing Visibility of Contributions. Concurrency and Computation: Practice and Experience, 33(19), e6129. https://doi.org/10.1002/cpe.6129
- Bhardwaj, A., Bhattacherjee, S., Chavan, A., Deshpande, A., Elmore, A. J., Madden, S., & Parameswaran, A. G. (2014). DataHub: Collaborative Data Science & Dataset Version Management at Scale. ArXiv. http://arxiv.org/abs/1409.0798
- Chandrabose, A., & Chakravarthi, B. R. (2021). An Overview of Fairness in Data – Illuminating the Bias in Data Pipeline. LTEDI. https://aclanthology.org/2021.ltedi-1.5
- Dong, W., & Fu, W.-T. (2010). Cultural Difference in Image Tagging. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 981–984. https://doi.org/10.1145/1753326.1753472
- Hanley, M., Khandelwal, A., Averbuch-Elor, H., Snavely, N., & Nissenbaum, H. (2020). An Ethical Highlighter for People-Centric Dataset Creation. ArXiv. http://arxiv.org/abs/2011.13583
- Hutchinson, B., Smart, A., Hanna, A., Denton, E., Greer, C., Kjartansson, O., Barnes, P., & Mitchell, M. (2021). Towards Accountability for Machine Learning Datasets: Practices from Software Engineering and Infrastructure. ArXiv. http://arxiv.org/abs/2010.13561
- Geiger, R., Cope, D., Ip, J., Lotosh, M., Shah, A., Weng, J., & Tang, R. (2021). “Garbage In, Garbage Out” Revisited: What Do Machine Learning Application Papers Report About Human-Labeled Training Data? ArXiv. https://doi.org/10.1162/qss_a_00144
- Holstein, K., Vaughan, J. W., Daumé III, H., Dudík, M., & Wallach, H. (2019). Improving Fairness in Machine Learning Systems: What Do Industry Practitioners Need? Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, 1–16. https://doi.org/10.1145/3290605.3300830
- Muller, M. J., Wolf, C. T., Andres, J., Desmond, M., Joshi, N. N., Ashktorab, Z., Sharma, A., Brimijoin, K., Pan, Q., Duesterwald, E., & Dugan, C. (2021). Designing Ground Truth and the Social Life of Labels. Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, 1-16. https://doi.org/10.1145/3411764.3445402
- Polyzotis, N., Roy, S., Whang, S. E., & Zinkevich, M. (2018). Data Lifecycle Challenges in Production Machine Learning: A Survey. ACM SIGMOD Record, 47(2), 17–28. https://doi.org/10.1145/3299887.3299891
- Roh, Y., Heo, G., & Whang, S. E. (2021). A Survey on Data Collection for Machine Learning: A Big Data - AI Integration Perspective. IEEE Transactions on Knowledge and Data Engineering, 33(4), 1328–1347. https://doi.org/10.1109/TKDE.2019.2946162
- Tatman, R. (2018). Setting Up Your Public Data for Success. 2018 IEEE International Conference on Big Data (Big Data), 3261–3262. https://doi.org/10.1109/BigData.2018.8622190
- Sachdeva, P. S., Barreto, R., von Vacano, C., & Kennedy, C. J. (2022). Assessing Annotator Identity Sensitivity via Item Response Theory: A Case Study in a Hate Speech Corpus. 2022 ACM Conference on Fairness, Accountability, and Transparency, 1585–1603. https://doi.org/10.1145/3531146.3533216
- Sambasivan, N., & Veeraraghavan, R. (2022). The Deskilling of Domain Expertise in AI Development. Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, 1-14. https://doi.org/10.1145/3491102.3517578
- Shanmugam, D., Diaz, F., Shabanian, S., Funck, M., & Biega, A. (2022). Learning to Limit Data Collection via Scaling Laws: A Computational Interpretation for the Legal Principle of Data Minimization. 2022 ACM Conference on Fairness, Accountability, and Transparency, 839-849. https://doi.org/10.1145/3531146.3533148
- Vaughan, J. W. (2018). Making Better Use of the Crowd: How Crowdsourcing Can Advance Machine Learning Research. Journal of Machine Learning Research, 18(193), 1–46. https://dl.acm.org/doi/10.5555/3122009.3242050
- Wang, D., Prabhat, S., & Sambasivan, N. (2022). Whose AI Dream? In search of the aspiration in data annotation. Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, 1-16. https://doi.org/10.1145/3491102.3502121
5. ANALYSES OF TRAINING DATASETS
This section highlights works that analyze training datasets from a variety of methodological and theoretical perspectives. While we understand that many of the titles that span across the major headings in this reading list involve some form of “dataset analysis,” we highlight in this particular section studies in which the analysis itself comprises the thrust of the article/chapter/work. The works in this section focus primarily on the details of the analysis as opposed to conducting an analysis as a preliminary step to introduce a more central argument or intervention.
a. Sociotechnical & Critical Studies
This subsection focuses on articles and chapters that approach their analyses of training datasets grounded in frameworks primarily taken from critical studies or science and technology studies.
- Bao, M., Zhou, A., Zottola, S. A., Brubach, B., Desmarais, S., Horowitz, A., Lum, K., & Venkatasubramanian, S. (2021). It’s COMPASlicated: The Messy Relationship between RAI Datasets and Algorithmic Fairness Benchmarks. ArXiv. https://arxiv.org/abs/2106.05498
- Busch, L. (2014). A Dozen Ways to Get Lost in Translation: Inherent Challenges in Large Scale Data Sets. International Journal of Communication, 8, 1727-1744. https://ijoc.org/index.php/ijoc/article/view/2160
- Coleman, C. N. (2020). Managing Bias When Library Collections Become Data. International Journal of Librarianship, 5(1), 8–19. https://doi.org/10.23974/ijol.2020.vol5.1.162
- Coveney, P. V., Dougherty, E. R., & Highfield, R. R. (2016). Big Data Need Big Theory Too. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 374, 1-11. https://doi.org/10.1098/rsta.2016.0153
- Feinberg, M. (2017). A Design Perspective on Data. Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, 2952–2963. https://doi.org/10.1145/3025453.3025837
- Jo, E. S., & Gebru, T. (2020). Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning. Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, 306–316. https://doi.org/10.1145/3351095.3372829
- Prabhu, V. U., & Birhane, A. (2020). Large Image Datasets: A Pyrrhic Win for Computer Vision? ArXiv. http://arxiv.org/abs/2006.16923
- Richardson, R., Schultz, J. M., & Crawford, K. (2019). Dirty Data, Bad Predictions: How Civil Rights Violations Impact Police Data, Predictive Policing Systems, and Justice. NYU Law Review, 94(15), 15–55. https://www.nyulawreview.org/online-features/dirty-data-bad-predictions-how-civil-rights-violations-impact-police-data-predictive-policing-systems-and-justice/
- Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P. K., & Aroyo, L. (2021). “Everyone Wants to Do the Model Work, Not the Data Work”: Data Cascades in High-Stakes AI. Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, 1-15. https://doi.org/10.1145/3411764.3445518
- Scheuerman, M. K., Denton, E., & Hanna, A. (2021). Do Datasets Have Politics? Disciplinary Values in Computer Vision Dataset Development. ArXiv. https://doi.org/10.1145/3476058
- Scheuerman, M. K., Paul, J. M., & Brubaker, J. R. (2019). How Computers See Gender: An Evaluation of Gender Classification in Commercial Facial Analysis Services. Proceedings of the ACM on Human-Computer Interaction, 3(CSCW), 1-33. https://doi.org/10.1145/3359246
- Scheuerman, M. K., Wade, K., Lustig, C., & Brubaker, J. R. (2020). How We’ve Taught Algorithms to See Identity: Constructing Race and Gender in Image Databases for Facial Analysis. Proceedings of the ACM on Human-Computer Interaction, 4(CSCW1), 1-35. https://doi.org/10.1145/3392866
- Smits, T., & Wevers, M. (2021). The Agency of Computer Vision Models as Optical Instruments. Visual Communication, 1-21. https://doi.org/10.1177/1470357221992097
- Stevens, N., & Keyes, O. (2021). Seeing infrastructure: Race, Facial Recognition and the Politics of Data. Cultural Studies, 35(4-5), 833-853. https://doi.org/10.1080/09502386.2021.1895252
- Trewin, S. (2018). AI Fairness for People with Disabilities: Point of View. ArXiv. http://arxiv.org/abs/1811.10670
b. Technical Approaches to Studying Datasets
Here, we introduce works that detail “technical” methods for the study of datasets. While the titles housed under the following subsection 5c, “Technical Audits,” deal with the investigative technical analysis of particular datasets, the works in this subsection are more concerned with introducing technical methods to approach the study of datasets and their particular components. Many of these studies do contain audit-style analyses, but we differentiate them from subsection 5c because their focus is on introducing or using technical methods for dataset analysis in general, as opposed to dissecting various components of particular datasets.
- Balayn, A., Kulynych, B., & Guerses, S. (2021). Exploring Data Pipelines through the Process Lens: A Reference Model for Computer Vision. ArXiv. https://arxiv.org/abs/2107.01824
- Bender, E. M., Gebru, T., McMillan-Major, A., & Mitchell, M. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜. FAccT. https://doi.org/10.1145/3442188.3445922
- Blodgett, S. L., Lopez, G., Olteanu, A., Sim, R., & Wallach, H. (202). Stereotyping Norwegian Salmon: An Inventory of Pitfalls in Fairness Benchmark Datasets. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, 1, 1004-1015.
- Cheng, V., Suriyakumar, V., Dullerud, N., Joshi, S., & Ghassemi, M. (2021). Can You Fake It Until You Make It?: Impacts of Differentially Private Synthetic Data on Downstream Classification Fairness. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 149-160. https://doi.org/10.1145/3442188.3445879
- Gardner, M., Merrill, W., Dodge, J., Peters, M. E., Ross, A., Singh, S., & Smith, N. A. (2021). Competency Problems: On Finding and Removing Artifacts in Language Data. ArXiv. https://arxiv.org/abs/2104.08646
- Fabbrizzi, S., Papadopoulos, S., Ntoutsi, E., & Kompatsiaris, Y. (2021). A Survey on Bias in Visual Datasets. ArXiv. https://arxiv.org/abs/2107.07919
- Hirota, Y., Nakashima, Y., & Garcia, N. (2022). Gender and Racial Bias in Visual Question Answering Datasets. 2022 ACM Conference on Fairness, Accountability, and Transparency, 1280–1292. https://doi.org/10.1145/3531146.3533184
- Hutchinson, B., Rostamzadeh, N., Greer, C., Heller, K., & Prabhakaran, V. (2022). Evaluation Gaps in Machine Learning Practice. 2022 ACM Conference on Fairness, Accountability, and Transparency, 1859–1876. https://doi.org/10.1145/3531146.3533233
- Jung, T., Kang, D., Mentch, L., & Hovy, E. (2019). Earlier Isn’t Always Better: Sub-aspect Analysis on Corpus and System Biases in Summarization. ArXiv. http://arxiv.org/abs/1908.11723
- Kilgarriff, A., & Grefenstette, G. (2003). Introduction to the Special Issue on the Web as Corpus. Computational Linguistics, 29(3), 333–348. https://doi.org/10.1162/089120103322711569
- Koesten, L., Vougiouklis, P., Simperl, E., & Groth, P. (2020). Dataset Reuse: Toward Translating Principles to Practice. Patterns, 1(8), 100136. https://doi.org/10.1016/j.patter.2020.100136
- Laranjeira da Silva, C., Macedo, J., Avila, S., & dos Santos, J. (2022). Seeing without Looking: Analysis Pipeline for Child Sexual Abuse Datasets. 2022 ACM Conference on Fairness, Accountability, and Transparency, 2189–2205. https://doi.org/10.1145/3531146.3534636
- Madras, D., Creager, E., Pitassi, T., & Zemel, R. (2019). Fairness through Causal Awareness: Learning Causal Latent-Variable Models for Biased Data. Proceedings of the Conference on Fairness, Accountability, and Transparency, 349–358. https://doi.org/10.1145/3287560.3287564
- Moreno-Torres, J. G., Raeder, T., Alaiz-Rodríguez, R., Chawla, N. V., & Herrera, F. (2012). A Unifying View on Dataset Shift in Classification. Pattern Recognition, 45(1), 521–530. https://doi.org/10.1016/j.patcog.2011.06.019
- Olson, R. S., La Cava, W., Orzechowski, P., Urbanowicz, R. J., & Moore, J. H. (2017). PMLB: A Large Benchmark Suite for Machine Learning Evaluation and Comparison. BioData Mining, 10(36). https://doi.org/10.1186/s13040-017-0154-4
- Rabanser, S., Günnemann, S., & Lipton, Z. C. (2019). Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift. ArXiv. http://arxiv.org/abs/1810.11953
- Rieke, A., Sutherland, V., Svirsky, D., & Hsu, M. (2022). Imperfect Inferences: A Practical Assessment. 2022 ACM Conference on Fairness, Accountability, and Transparency, 767-777. https://doi.org/10.1145/3531146.3533140
- Straw, I., & Callison-Burch, C. (2020). Artificial Intelligence in Mental Health and the Biases of Language Based Models. PLOS ONE, 15(12), e0240376. https://doi.org/10.1371/journal.pone.0240376
- Welty, C., Paritosh, P., & Aroyo, L. (2019). Metrology for AI: From Benchmarks to Instruments. ArXiv. https://arxiv.org/abs/1911.01875v1
- Wesley, A. M., & Matisziw, T. C. (2021). Methods for Measuring Geodiversity in Large Overhead Imagery Datasets. IEEE Access, 9, 100279–100293. https://doi.org/10.1109/ACCESS.2021.3096034
- Zanella-Béguelin, S., Wutschitz, L., Tople, S., Rühle, V., Paverd, A., Ohrimenko, O., Köpf, B., & Brockschmidt, M. (2020). Analyzing Information Leakage of Updates to Natural Language Models. Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, 363–375. https://doi.org/10.1145/3372297.3417880
- Zhong, R., Chen, Y., Patton, D., Selous, C., & McKeown, K. (2019). Detecting and Reducing Bias in a High Stakes Domain. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 4765–4775. https://doi.org/10.18653/v1/D19-1483
c. Technical Audits
This subsection includes works that employ technical audit-style investigations (e.g., Buolamwini & Gebru, 2018; Raji et al, 2020) of particular datasets.
- Babaeianjelodar, M., Lorenz, S., Gordon, J., Matthews, J., & Freitag, E. (2020). Quantifying Gender Bias in Different Corpora. Companion Proceedings of the Web Conference 2020, 752–759. https://doi.org/10.1145/3366424.3383559
- Bountouridis, D., Makhortykh, M., Sullivan, E., Harambam, J., Tintarev, N., & Hauff, C. (2019). Annotating Credibility: Identifying and Mitigating Bias in Credibility Datasets. ROME 2019 - Workshop on Reducing Online Misinformation Exposure. https://rome2019.github.io/papers/Bountouridis_etal_ROME2019.pdf
- Buolamwini, J., & Gebru, T. (2018, January). Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. Conference on fairness, accountability and transparency, 77-91. https://www.media.mit.edu/publications/gender-shades-intersectional-accuracy-disparities-in-commercial-gender-classification/
- Costanza-Chock, S., Raji, I. D., & Buolamwini, J. (2022). Who Audits the Auditors? Recommendations from a field scan of the algorithmic auditing ecosystem. 2022 ACM Conference on Fairness, Accountability, and Transparency, 1571–1583. https://doi.org/10.1145/3531146.3533213
- Davidson, T., Bhattacharya, D., & Weber, I. (2019). Racial Bias in Hate Speech and Abusive Language Detection Datasets. ArXiv. http://arxiv.org/abs/1905.12516
- Dulhanty, C., & Wong, A. (2019). Auditing ImageNet: Towards a Model-driven Framework for Annotating Demographic Attributes of Large-Scale Image Datasets. ArXiv. http://arxiv.org/abs/1905.01347
- Dulhanty, C., & Wong, A. (2020). Investigating the Impact of Inclusion in Face Recognition Training Data on Individual Face Identification. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, 244–250. https://doi.org/10.1145/3375627.3375875
- Dulhanty, C. (2020). Issues in Computer Vision Data Collection: Bias, Consent, and Label Taxonomy [University of Waterloo]. https://uwspace.uwaterloo.ca/handle/10012/16414
- Heinzerling, B. (2019, July 21). NLP’s Clever Hans Moment has Arrived. Benjamin Heinzerling. https://bheinzerling.github.io/post/clever-hans/
- Hutchinson, B., Prabhakaran, V., Denton, E., Webster, K., Zhong, Y., & Denuyl, S. (2020). Social Biases in NLP Models as Barriers for Persons with Disabilities. ArXiv. http://arxiv.org/abs/2005.00813
- Klockmann, V., von Schenk, A., & Villeval, M. C. (2021). Artificial Intelligence, Ethics, and Diffused Pivotality. Working Paper Series, GATE. https://ssrn.com/abstract=3853829
- Luccioni, A., & Viviano, J. (2021). What’s in the Box? A Preliminary Analysis of Undesirable Content in the Common Crawl Corpus. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 182-189. https://aclanthology.org/2021.acl-short.24.pdf
- Mecati, M., Cannavò, F. E., Vetrò, A., & Torchiano, M. (2020). Identifying Risks in Datasets for Automated Decision–Making. In G. Viale Pereira, M. Janssen, H. Lee, I. Lindgren, M. P. Rodríguez Bolívar, H. J. Scholl, & A. Zuiderwijk (Eds.), Electronic Government (pp. 332–344). Springer International Publishing. https://doi.org/10.1007/978-3-030-57599-1_25
- Raji, I. D., & Fried, G. (2021). About Face: A Survey of Facial Recognition Evaluation. ArXiv. http://arxiv.org/abs/2102.00813
- Raji, I. D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., & Denton, E. (2020). Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing. ArXiv. http://arxiv.org/abs/2001.00964
- Rambachan, A., & Roth, J. (2020). Bias In, Bias Out? Evaluating the Folk Wisdom. ArXiv. https://doi.org/10.4230/LIPIcs.FORC.2020.6
- Shankar, S., Halpern, Y., Breck, E., Atwood, J., Wilson, J., & Sculley, D. (2017). No Classification without Representation: Assessing Geodiversity Issues in Open Data Sets for the Developing World. ArXiv. https://arxiv.org/abs/1711.08536
- Vidgen, B., & Derczynski, L. (2020). Directions in Abusive Language Training Data: Garbage In, Garbage Out. ArXiv. https://arxiv.org/abs/2004.01670
- Wang, T., Zhao, J., Yatskar, M., Chang, K.-W., & Ordonez, V. (2019). Balanced Datasets Are Not Enough: Estimating and Mitigating Gender Bias in Deep Image Representations. 2019 IEEE/CVF International Conference on Computer Vision (ICCV). https://doi.org/10.1109/ICCV.2019.00541
d. Visual & Artistic Approaches to Datasets
This final subsection assembles artistic and visual approaches/formats for the analysis of datasets.
- Baker, D. (2022). Datasets Have Worldviews [Website]. PAIR Explorables. https://pair.withgoogle.com/explorables/dataset-worldviews/
- Crawford, K. & Paglen, T. (2019). Training Humans [Large-scale exhibition]. Fondazione Prada, Milan, 2019-2020. https://www.fondazioneprada.org/project/training-humans/?lang=enPublication: Training Humans Book
- Dewey-Hagbord, H. (2019). How Do You See Me? [Adversarial processes]. The Photographer’s Gallery, London, UK. https://thephotographersgallery.org.uk/whats-on/heather-dewey-hagborg-how-do-you-see-me
- Malevé, N. (2019).12 hours of ImageNet [Computer script]. The Photographer’s Gallery, London, UK. https://thephotographersgallery.org.uk/whats-on/exhibiting-imagenet
- Paglen, T. and Crawford, K. (2019). Imagenet Roulette [Software program]. Launched at SXSW. https://www.youtube.com/watch?v=S0yEPZJnvgs
- Pipkin, E. (2020). On Lacework: Watching an Entire Machine-Learning Dataset. Unthinking Photography. https://unthinking.photography/articles/on-lacework
- Ridler, A. (2018). Myriad (Tulips) [C-type digital prints with handwritten annotations, magnetic paint, magnets]. Barbican Centre, London, UK. http://annaridler.com/myriad-tulips
6. RESPONSES TO DATASET PROBLEMS
Here we assemble literature that proposes responses to commonly identified sociotechnical problems with ML datasets. Most of the articles in this vein focused on technical responses to addressing bias (writ broadly), while a few address other concerns such as privacy and security. We do not necessarily endorse these approaches; rather, this is a loose mapping of emerging areas of focus in response to problems. Note that there is some overlap with the readings suggested in Section 5, as many of these papers investigate particular datasets; however, the papers listed here emphasize approaches to addressing specific problems.
a. General Recommendations for Dataset Design
This subsection covers miscellaneous broad recommendations for the creation of fairer and more accountable datasets.
- Andrus, M., & Villeneuve, S. (2022). Demographic-Reliant Algorithmic Fairness: Characterizing the Risks of Demographic Data Collection in the Pursuit of Fairness. 2022 ACM Conference on Fairness, Accountability, and Transparency, 1709–1721. https://doi.org/10.1145/3531146.3533226
- Bilstrup, K.-E. K., Kaspersen, M. H., Assent, I., Enni, S., & Petersen, M. G. (2022). From Demo to Design in Teaching Machine Learning. 2022 ACM Conference on Fairness, Accountability, and Transparency, 2168–2178. https://doi.org/10.1145/3531146.3534634
- Bowman, S. R., & Dahl, G. E. (2021). What Will it Take to Fix Benchmarking in Natural Language Understanding? NAACL. https://doi.org/10.18653/V1/2021.NAACL-MAIN.385
- Boyd, K. (2022). Designing Up with Value-Sensitive Design: Building a Field Guide for Ethical ML Development. 2022 ACM Conference on Fairness, Accountability, and Transparency, 2069–2082. https://doi.org/10.1145/3531146.3534626
- Kiela, D., Bartolo, M., Nie, Y., Kaushik, D., Geiger, A., Wu, Z., Vidgen, B., Prasad, G., Singh, A., Ringshia, P., Ma, Z., Thrush, T., Riedel, S., Waseem, Z., Stenetorp, P., Jia, R., Bansal, M., Potts, C., & Williams, A. (2021). Dynabench: Rethinking Benchmarking in NLP. NAACL. https://doi.org/10.18653/V1/2021.NAACL-MAIN.324
- Panch, T., Pollard, T. J., Mattie, H., Lindemer, E., Keane, P. A., & Celi, L. A. (2020). “Yes, But Will It Work for My Patients?” Driving Clinically Relevant Research with Benchmark Datasets. Npj Digital Medicine, 3(1), 1–4. https://doi.org/10.1038/s41746-020-0295-6
- Peng, K., Mathur, A., & Narayanan, A. (2021). Mitigating Dataset Harms Requires Stewardship: Lessons from 1000 Papers. ArXiv. http://arxiv.org/abs/2108.02922
- Rogers, A. (2020). Changing the World by Changing the Data. ArXiv. https://arxiv.org/abs/2105.13947
- Rolf, E., Worledge, T., Recht, B., & Jordan, M. I. (2021). Representation Matters: Assessing the Importance of Subgroup Allocations in Training Data. ArXiv. https://arxiv.org/abs/2103.03399
- Selbst, A. D., Boyd, D., Friedler, S. A., Venkatasubramanian, S., & Vertesi, J. (2019). Fairness and Abstraction in Sociotechnical Systems. Proceedings of the Conference on Fairness, Accountability, and Transparency, 59–68. https://doi.org/10.1145/3287560.3287598
- Suresh, H., Movva, R., Lee Dogan, A., Bhargava, D., Isadora, C., Martinez Cuba, A., Taurino, G., So, W., & D’Ignazio, C. (2022). Towards Intersectional Feminist and Participatory ML: A Case Study in Supporting Femicide Counterdata Collection. 2022 ACM Conference on Fairness, Accountability, and Transparency, 667-678. https://doi.org/10.1145/3531146.3533132
- Stasaski, K., Yang, G. H., & Hearst, M. A. (2020). More Diverse Dialogue Datasets via Diversity-Informed Data Collection. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 4958–4968. https://doi.org/10.18653/v1/2020.acl-main.446
b. Creating New Datasets and/or Remediation of Existing Datasets
This subsection includes articles that either remediate specific existing datasets or detail the creation of alternative datasets to address identified privacy and bias issues.
- Asano, Y., Rupprecht, C., Zisserman, A., & Vedaldi, A. (2021). PASS: An ImageNet Replacement for Self-Supervised Pretraining Without Humans. ArXiv. https://arxiv.org/abs/2109.13228
- Brown, H., Lee, K., Mireshghallah, F., Shokri, R., & Tramèr, F. (2022). What Does it Mean for a Language Model to Preserve Privacy? 2022 ACM Conference on Fairness, Accountability, and Transparency, 2280–2292. https://doi.org/10.1145/3531146.3534642
- Cai, W., Encarnacion, R., Chern, B., Corbett-Davies, S., Bogen, M., Bergman, S., & Goel, S. (2022). Adaptive Sampling Strategies to Construct Equitable Training Datasets. 2022 ACM Conference on Fairness, Accountability, and Transparency, 1467–1478. https://doi.org/10.1145/3531146.3533203
- Jernite, Y., Nguyen, H., Biderman, S., Rogers, A., Masoud, M., Danchev, V., Tan, S., Luccioni, A. S., Subramani, N., Johnson, I., Dupont, G., Dodge, J., Lo, K., Talat, Z., Radev, D., Gokaslan, A., Nikpoor, S., Henderson, P., Bommasani, R., & Mitchell, M. (2022). Data Governance in the Age of Large-Scale Data-Driven Language Technology. 2022 ACM Conference on Fairness, Accountability, and Transparency, 2206–2222. https://doi.org/10.1145/3531146.3534637
- Khashabi, D., Chaturvedi, S., Roth, M., Upadhyay, S., & Roth, D. (2018). Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1, 252–262. https://doi.org/10.18653/v1/N18-1023
- Yang, K., Qinami, K., Fei-Fei, L., Deng, J., & Russakovsky, O. (2020). Towards Fairer Datasets: Filtering and Balancing the Distribution of the People Subtree in the ImageNet Hierarchy. FAT* '20: Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, 547-558. https://doi.org/10.1145/3351095.3375709
- Yang, K., Yau, J., Fei-Fei, L., Deng, J., & Russakovsky, O. (2021). A Study of Face Obfuscation in ImageNet. ArXiv. https://arxiv.org/abs/2103.06191
- Zellers, R., Bisk, Y., Schwartz, R., & Choi, Y. (2018). SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference. ArXiv. https://arxiv.org/abs/1808.05326v1
c. Data Annotation Workflows
Articles in this subsection address biased machine learning datasets by proposing changes to data annotation processes.
- Barbosa, N. M., & Chen, M. (2019). Rehumanized Crowdsourcing: A Labeling Framework Addressing Bias and Ethics in Machine Learning. Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, 1–12. http://doi.org/10.1145/3290605.3300773
- Beretta, E., Vetrò, A., Lepri, B., & Martin, J. C. D. (2021). Detecting Discriminatory Risk Through Data Annotation Based on Bayesian Inferences. FAccT. https://doi.org/10.1145/3442188.3445940
- Beretta, E., Vetrò, A., Lepri, B., & De Martin, J. C. (2019). Ethical and Socially-Aware Data Labels. In J. A. Lossio-Ventura, D. Muñante, & H. Alatrista-Salas (Eds.), Information Management and Big Data, 320–327. Springer International Publishing. https://doi.org/10.1007/978-3-030-11680-4_30
- Rateike, M., Majumdar, A., Mineeva, O., Gummadi, K. P., & Valera, I. (2022). Don’t Throw it Away! The Utility of Unlabeled Data in Fair Decision Making. 2022 ACM Conference on Fairness, Accountability, and Transparency, 1421–1433. https://doi.org/10.1145/3531146.3533199
d. Data Augmentation
Articles in this subsection offer approaches to reducing bias in datasets by changing their composition via techniques such as oversampling or the use of synthetic/pseudo-data.
- Iosifidis, V., & Ntoutsi, E. (2018). Dealing with Bias via Data Augmentation in Supervised Learning Scenarios. http://ceur-ws.org/Vol-2103/paper_5.pdf
- Pastaltzidis, I., Dimitriou, N., Quezada-Tavarez, K., Aidinlis, S., Marquenie, T., Gurzawska, A., & Tzovaras, D. (2022). Data augmentation for fairness-aware machine learning: Preventing algorithmic bias in law enforcement systems. 2022 ACM Conference on Fairness, Accountability, and Transparency, 2302–2314. https://doi.org/10.1145/3531146.3534644
- Sharma, S., Zhang, Y., Ríos Aliaga, J. M., Bouneffouf, D., Muthusamy, V., & Varshney, K. R. (2020). Data Augmentation for Discrimination Prevention and Bias Disambiguation. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, 358–364. https://doi.org/10.1145/3375627.3375865
- Tomalin, M., Byrne, B., Concannon, S., Saunders, D., & Ullmann, S. (2021). The Practical Ethics of Bias Reduction in Machine Translation: Why Domain Adaptation is Better than Data Debiasing. Ethics and Information Technology, 23, 419-433. https://doi.org/10.1007/s10676-021-09583-1
e. Bias Detection
This subsection gathers tools and approaches for detecting bias in datasets.
- Chapman, A., Grylls, P., Ugwudike, P., Gammack, D., & Ayling, J. (2022). A Data-Driven Analysis of the Interplay Between Criminology Theory and Predictive Policing Algorithms. 2022 ACM Conference on Fairness, Accountability, and Transparency, 36-45. https://doi.org/10.1145/3531146.3533071
- Goyal, P., Romero Soriano, A., Hazirbas, C., Levent, S., & Usunier, N. (2022). Fairness Indicators for Systematic Assessments of Visual Feature Extractors. 2022 ACM Conference on Fairness, Accountability, and Transparency, 70-88. https://doi.org/10.1145/3531146.3533074
- Harris, C., Halevy, M., Howard, A., Bruckman, A., & Yang, D. (2022). Exploring the Role of Grammar and Word Choice in Bias Toward African American English (AAE) in Hate Speech Classification. 2022 ACM Conference on Fairness, Accountability, and Transparency, 789-798. https://doi.org/10.1145/3531146.3533144
- Hu, X., Wang, H., Vegesana, A., Dube, S., Yu, K., Kao, G., Chen, S.-H., Lu, Y.-H., Thiruvathukal, G. K., & Yin, M. (2020). Crowdsourcing Detection of Sampling Biases in Image Datasets. Proceedings of The Web Conference 2020, 2955–2961. https://doi.org/10.1145/3366423.3380063
- Leavy, S., Meaney, G., Wade, K., & Greene, D. (2020). Mitigating Gender Bias in Machine Learning Data Sets. In L. Boratto, S. Faralli, M. Marras, & G. Stilo (Eds.), Bias and Social Aspects in Search and Recommendation, 12–26. Springer International Publishing. https://doi.org/10.1007/978-3-030-52485-2_2
- Pahl, J., Rieger, I., Mӧller, A., Wittenberg, T., & Schmid, U. (2022). Female, White, 27? Bias Evaluation on Data and Algorithms for Affect Recognition in Faces. 2022 ACM Conference on Fairness, Accountability, and Transparency, 973-987. https://doi.org/10.1145/3531146.3533159
- Srinivasan, R., & Chander, A. (n.d.). Understanding Bias in Datasets using Topological Data Analysis. 7. http://ceur-ws.org/Vol-2419/paper_9.pdf
- Verma, S., Ernst, M., & Just, R. (2021). Removing Biased Data to Improve Fairness and Accuracy. ArXiv. https://arxiv.org/abs/2102.03054
- Wang, A., Barocas, S., Laird, K., & Wallach, H. (2022). Measuring Representational Harms in Image Captioning. 2022 ACM Conference on Fairness, Accountability, and Transparency, 324-335. https://doi.org/10.1145/3531146.3533099
- Wang, A., Narayanan, A., & Russakovsky, O. (2020). REVISE: A Tool for Measuring and Mitigating Bias in Visual Datasets. ECCV, 733-751. https://doi.org/10.1007/978-3-030-58580-8_43
- Wang, A., Ramaswamy, V. V., & Russakovsky, O. (2022). Towards Intersectionality in Machine Learning: Including More Identities, Handling Underrepresetation, and Performing Evaluation. 2022 ACM Conference on Fairness, Accountability, and Transparency, 336-349. https://doi.org/10.1145/3531146.3533101
- Zamfirescu-Pereira, J. D., Chen, J., Wen, E, Koenecke, A., Garg, N., & Pierson, E. (2022) Trucks Don’t Mean Trump: Diagnosing Human Error in Image Analysis. 2022 ACM Conference on Fairness, Accountability, and Transparency, 799-813. https://doi.org/10.1145/3531146.3533145
f. Algorithms to Debias Datasets or Mitigate Bias
Research in this subsection deploys algorithmic techniques to either debias datasets before training ML models on them or intervene to mitigate bias after training.
- Abbasi-Sureshjani, S., Raumanns, R., Michels, B. E. J., Schouten, G., & Cheplygina, V. (2020). Risk of Training Diagnostic Algorithms on Data with Demographic Bias. In J. Cardoso et al (Eds.), Interpretable and Annotation-Efficient Learning for Medical Image Computing, 183–192. Springer. https://doi.org/10.1007/978-3-030-61166-8_20
- Almuzaini, A. A., Bhatt, C. A., Pennock, D. M., & Singh, V. K. (2022). ABCinML: Anticipatory Bias Correction in Machine Learning Applications. 2022 ACM Conference on Fairness, Accountability, and Transparency, 1552–1560. https://doi.org/10.1145/3531146.3533211
- Anahideh, H., Asudeh, A., & Thirumuruganathan, S. (2021). Fair Active Learning. ArXiv. http://arxiv.org/abs/2001.0179
- Bolukbasi, T., Chang, K.-W., Zou, J., Saligrama, V., & Kalai, A. (2016). Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings. ArXiv. http://arxiv.org/abs/1607.06520
- Hendricks, L. A., Burns, K., Saenko, K., Darrell, T., & Rohrbach, A. (2018). Women Also Snowboard: Overcoming Bias in Captioning Models. ECCV, 771–787. https://openaccess.thecvf.com/content_ECCV_2018/html/Lisa_Anne_Hendricks_Women_also_Snowboard_ECCV_2018_paper.html
- Lum, K., Zhang, Y., & Bower, A. (2022). De-Biasing “Bias” Measurement. 2022 ACM Conference on Fairness, Accountability, and Transparency, 379-389. https://doi.org/10.1145/3531146.3533105
- Reimers, C., Bodesheim, P., Runge, J., & Denzler, J. (2021). Towards Learning an Unbiased Classifier from Biased Data via Conditional Adversarial Debiasing. ArXiv. https://arxiv.org/abs/2103.06179
- Ryu, H. J., Mitchell, M., & Adam, H. (2017). InclusiveFaceNet: Improving Face Attribute Detection with Race and Gender Diversity. ArXiv. https://arxiv.org/abs/1712.00193
- Schick, T., Udupa, S., & Schütze, H. (2021). Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP. ArXiv. https://arxiv.org/abs/2103.00453
- Sikdar, S., Lemmerich, F., & Strohmaier, M. (2022). GetFair: Generalized Fairness Tuning of Classification Models. 2022 ACM Conference on Fairness, Accountability, and Transparency, 289-299. https://doi.org/10.1145/3531146.3533094
- Zhao, J., Wang, T., Yatskar, M., Ordonez, V., & Chang, K.-W. (2017). Men Also Like Shopping: Reducing Gender Bias Amplification Using Corpus-level Constraints. ArXiv. http://arxiv.org/abs/1707.09457
7. DATASET DOCUMENTATION PRACTICES
In recent years, there have been calls to increase transparency and standardization for ML datasets so that researchers can better study their composition and effects, as well as identify problems. This section collects these various approaches to dataset documentation.
- Bandy, J., & Vincent, N. (2021). Addressing “Documentation Debt” in Machine Learning Research: A Retrospective Datasheet for BookCorpus. ArXiv. https://arxiv.org/abs/2105.05241
- Barclay, I., Preece, A., Taylor, I., Radha, S. K., & Nabrzyski, J. (2021). Providing Assurance and Scrutability on Shared Data and Machine Learning Models with Verifiable Credentials. ArXiv. http://arxiv.org/abs/2105.06370
- Barclay, I., Preece, A., Taylor, I., & Verma, D. (2019). Towards Traceability in Data Ecosystems Using a Bill of Materials Model. ArXiv. http://arxiv.org/abs/1904.04253
- Bender, E. M., & Friedman, B. (2018). Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science. Transactions of the Association for Computational Linguistics, 6, 587–604. https://doi.org/10.1162/tacl_a_00041
- Benjamin, M., Gagnon, P., Rostamzadeh, N., Pal, C., Bengio, Y., & Shee, A. (2019). Towards Standardization of Data Licenses: The Montreal Data License. ArXiv. https://doi.org/10.48550/arXiv.1903.12262
- Boyd, K. (2020). Understanding and Intervening in Machine Learning Ethics: Supporting Ethical Sensitivity in Training Data Curation. ProQuest [University of Maryland, College Park]. https://www.proquest.com/openview/046800aae7b57cc51efdc1caa7a84cba/1?pq-origsite=gscholar&cbl=18750&diss=y
- Crisan, A., Drouhard, M., Vig, J., & Rajani, N. (2022). Interactive Model Cards: A Human-Centered Approach to Model Documentation. 2022 ACM Conference on Fairness, Accountability, and Transparency, 427-439. https://doi.org/10.1145/3531146.3533108
- Díaz, M., Kivlichan, I., Rosen, R., Baker, D., Amironesei, R., Prabhakaran, V., & Denton, E. (2022). CrowdWorkSheets: Accounting for Individual and Collective Identities Underlying Crowdsourced Dataset Annotation. 2022 ACM Conference on Fairness, Accountability, and Transparency, 2342–2351. https://doi.org/10.1145/3531146.3534647
- Fabris, A., Messina, S., Silvello, G., & Susto, G. A. (2022). Algorithmic Fairness Datasets: The Story so Far. ArXiv. https://doi.org/10.48550/arXiv.2202.01711
- Gansky, B., & McDonald, S. (2022). CounterFAccTual: How FAccT Undermines Its Organizing Principles. 2022 ACM Conference on Fairness, Accountability, and Transparency, 1982–1992. https://doi.org/10.1145/3531146.3533241
- Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Daumé III, H., & Crawford, K. (2020). Datasheets for Datasets. ArXiv. http://arxiv.org/abs/1803.09010
- Holland, S., Hosny, A., Newman, S., Joseph, J., & Chmielinski, K. (2018). The Dataset Nutrition Label: A Framework To Drive Higher Data Quality Standards. ArXiv. http://arxiv.org/abs/1805.03677
- Luccioni, A. S., Corry, F., Sridharan, H., Ananny, M., Schultz, J., & Crawford, K. (2022). A Framework for Deprecating Datasets: Standardizing Documentation, Identification, and Communication. 2022 ACM Conference on Fairness, Accountability, and Transparency, 199–212. https://doi.org/10.1145/3531146.3533086
- McMillan-Major, A., Osei, S., Rodriguez, J. D., Ammanamanchi, P. S., Gehrmann, S., & Jernite, Y. (2021). Reusable Templates and Guides For Documenting Datasets and Models for Natural Language Processing and Generation: A Case Study of the HuggingFace and GEM Data and Model Cards. ArXiv. https://arxiv.org/abs/2108.07374
- Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer, E., Raji, I. D., & Gebru, T. (2019). Model Cards for Model Reporting. Proceedings of the Conference on Fairness, Accountability, and Transparency, 220–229. https://doi.org/10.1145/3287560.3287596
- Pushkarna, M., Zaldivar, A., & Kjartansson, O. (2022). Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI. 2022 ACM Conference on Fairness, Accountability, and Transparency, 1776–1826. https://doi.org/10.1145/3531146.3533231
- Rostamzadeh, N., Mincu, D., Roy, S., Smart, A., Wilcox, L., Pushkarna, M., Schrouff, J., Amironesei, R., Moorosi, N., & Heller, K. (2022). Healthsheet: Development of a Transparency Artifact for Health Datasets. 2022 ACM Conference on Fairness, Accountability, and Transparency,, 1943–1961. https://doi.org/10.1145/3531146.3533239
- Seck, I., Dahmane, K., Duthon, P., & Loosli, G. (2018). Baselines and a Datasheet for the Cerema AWP dataset. ArXiv. http://arxiv.org/abs/1806.04016
- Schramowski, P., Tauchmann, C., & Kersting, K. (2022). Can Machines Help Us Answering Question 16 in Datasheets, and In Turn Reflecting on Inappropriate Content? 2022 ACM Conference on Fairness, Accountability, and Transparency, 1350–1361. https://doi.org/10.1145/3531146.3533192
- Srinivasan, R., Denton, E., Famularo, J., Rostamzadeh, N., Diaz, F., & Coleman, B. (2021). Artsheets for Art Datasets. Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). https://openreview.net/forum?id=K7ke_GZ_6N
- Zhang, W., Ohrimenko, O., & Cummings, R. (2022). Attribute Privacy: Framework and Mechanisms. 2022 ACM Conference on Fairness, Accountability, and Transparency, 757-766. https://doi.org/10.1145/3531146.3533139
8. CONFERENCES FOCUSED ON DATASETS
The scholarship summarized in this list spans academic fields, from science and technology studies (STS) to computer science, and human computer interaction (HCI) to library science. During the construction of this list, it became clear that certain conference venues and their proceedings are often associated with emerging work on training data. Other conference venues have dedicated workshops or particular tracks to the study of datasets. While this broader list represents training data scholarship at a particular moment in time, these locales provide sites where work on training data has been concentrated or is likely to be found.
- ACM CHI Conference on Human Factors in Computing Systems https://chi2021.acm.org/
- ACM Conference on Computer Supported Cooperative Work (CSCW) https://dl.acm.org/conference/cscw
- ACM Conference on Fairness, Accountability, and Transparency (ACM FAccT) https://facctconference.org/
- NeurIPS Datasets and Benchmarks Track https://neurips.cc/Conferences/2021/CallForDatasetsBenchmarks
- NeurIPS Data-Centric AI Workshop (2021) https://nips.cc/Conferences/2021/Schedule?showEvent=21860
9. PRESS TREATMENT OF DATASETS
Popular press treatments of training data have provided a foundation for broader public conversations about these artifacts. The press gathered here represents just a small sample of both the important investigative work into training data as well as cogent introductions to the subject. Articles are frequently published on these issues, so this is just a selection of starting points.
- Argoub, S. (2021, June 9). The NLP Divide: English is Not the Only Natural Language. Polis. https://blogs.lse.ac.uk/polis/2021/06/09/the-nlp-divide-english-is-not-the-only-natural-language/
- Buranyi, S. (2017, August 8). Rise of the Racist Robots – How AI is Learning All Our Worst Impulses. The Guardian. http://www.theguardian.com/inequality/2017/aug/08/rise-of-the-racist-robots-how-ai-is-learning-all-our-worst-impulses
- Elliott, V. (2021, August 3). Training Self-Driving Cars for $1 an Hour. Rest of World. https://restofworld.org/2021/self-driving-cars-outsourcing/
- McQuaid, J. (2021, October 18). Can AI’s Voracious Appetite Be Tamed? Undark Magazine. https://undark.org/2021/10/18/computer-scientists-try-to-sidestep-ai-data-dilemma/
- Smith, C. S. (2019, November 19). Dealing With Bias in Artificial Intelligence. The New York Times. https://www.nytimes.com/2019/11/19/technology/artificial-intelligence-bias.html
- Feathers, T. (2020, September 17). Fake Data Could Help Solve Machine Learning’s Bias Problem—If We Let It. Slate Magazine. https://slate.com/technology/2020/09/synthetic-data-artificial-intelligence-bias.html
- Gershgorn, D. (2018, September 6). If AI is Going to Be the World’s Doctor, It Needs Better Textbooks. Quartz. https://qz.com/1367177/if-ai-is-going-to-be-the-worlds-doctor-it-needs-better-textbooks/
- Johnson, K. (2021, June 17). The Efforts to Make Text-Based AI Less Racist and Terrible. Wired. https://www.wired.com/story/efforts-make-text-ai-less-racist-terrible/
- Johnson, K. (2021, August 4). This New Way to Train AI Could Curb Online Harassment. Wired. https://www.wired.com/story/new-way-train-ai-curb-online-harassment/
- Metz, C. (2019, September 20). ‘Nerd’, ‘Nonsmoker, ‘Wrongdoer,’: How Might A.I. Label You?, The New York Times. https://www.nytimes.com/2019/09/20/arts/design/imagenet-trevor-paglen-ai-facial-recognition.html
- Murgia, M., & Harlow, M. (2019, April 19). Who’s Using Your face? The Ugly Truth About Facial Recognition. Financial Times. https://www.ft.com/content/cf19b956-60a2-11e9-b285-3acd5d43599e
- Register, Y. L.. (2021, July 22). It’s All Training Data: Using Lessons from Machine Learning to Retrain Your Mind. The Gradient. https://thegradient.pub/its-all-training-data/
- Solon, O. (2019, March 12). Facial Recognition’s “Dirty Little Secret”: Social Media Photos Used Without Consent. NBC News. https://www.nbcnews.com/tech/internet/facial-recognition-s-dirty-little-secret-millions-online-photos-scraped-n981921
- Solon, O., & Farivar, C. (2019, May 9). Millions of People Uploaded Photos to the Ever App. Then the Company Used Them to Develop Facial Recognition Tools. NBC News. https://www.nbcnews.com/tech/security/millions-people-uploaded-photos-ever-app-then-company-used-them-n1003371