How should we study datasets in machine learning? As machine learning (ML) increasingly becomes a site of sociotechnical inquiry, invoking numerous social, political, legal, and ethical issues, datasets are a crucial component as they are core material used to train models. Inspired by Tarleton Gillespie and Nick Seaver’s Critical Algorithm Studies reading list, this collection is meant to serve as an entry point to the growing literature on ML datasets across the fields of computer science, human-computer interaction, science and technology studies, media studies, and histories of technology, among others. We compiled this list primarily as a resource for researchers seeking to understand—from a variety of perspectives—how ML datasets work, do work, and are worked upon. We hope it will also be of use to technology practitioners and students seeking to build ML systems.
We limit our scope to works that focus on datasets deployed in the training and testing of ML systems, and despite some overlap, this list is not a primer for the field of critical technology studies more generally. Entries are sorted into various sections with the intention of providing readers a preliminary structure that will help them follow their specific interests. We acknowledge that classificatory practice is always subjective and that many of these titles can fit appropriately under multiple sections or named in different ways. The current iteration is a reflection of our own ideas and what we find helpful as a way to organize the emerging literature that we are working with. There are certainly other ways to structure this reading list, and we are open to suggestions that expand its range and improve usability. Our focus is primarily on academic publications, but for those who are more interested in understanding how datasets have been discussed in the press as of July 2022, we offer a selection of examples at the end of the reading list.
This list is also not meant to be exhaustive. We see the list as a living resource and invite readers to make suggestions and contributions via this form if there are key titles that they think should be included. Please note that while all links are functional as of July 2022, we are unable to continuously monitor for updated versions of papers or fix broken links.
Despite these limitations, we hope this reading list might serve as a useful resource for scholars and practitioners investigating ML datasets as sociotechnical assemblages that shape and are shaped by social worlds.
CONTEXTUALIZING THE STUDY OF DATASETS
This section consists of broader foundational readings that don’t all necessarily deal specifically with machine learning datasets, but which the authors of this list have found useful to contextualize their study. We acknowledge that the titles below do not form an exhaustive index of all foundational readings, but point to them as particularly helpful ones for thinking about the ontological and epistemological complexities of the “dataset” as an object/genre of analysis.
a. Politics of Classification
This subsection focuses on classification as a practice of not only world-ordering, but also world-making, and how its logics underlie the ways in which datasets are conceived and built.
- Boutyline, A., & Soter, L. K. Cultural Schemas: What They Are, How to Find Them, and What to Do Once You’ve Caught One. American Sociological Review, 86(4), 728–758. https://doi.org/10.1177/00031224211024525
- Bechmann, A., & Bowker, G. C. (2019). Unsupervised by any other name: Hidden layers of knowledge production in artificial intelligence on social media. Big Data & Society, 6(1). https://doi.org/10.1177/2053951718819569
- Bowker, G. C., & Star, S. L. (2000). Sorting Things Out: Classification and Its Consequences. Cambridge, MA: MIT Press.
- Crawford, K. (2021). Atlas of AI: Power, Politics and the Planetary Costs of Artificial Intelligence, see ‘Classification’ chapter (pp. 123-150). New Haven, CT: Yale University Press.
- Fourcade, M., & Healy, K. (2013). Classification Situations: Life-Chances in the Neoliberal Era. Accounting, Organizations and Society, 38(8), 559-572. https://doi.org/10.1016/j.aos.2013.11.002.
- Goodwin, C. (2000). Practices of Color Classification. Mind, Culture, and Activity, 7(1&2), 19-36. https://doi.org/10.1080/10749039.2000.9677646
- Rieder, B. (2017). Scrutinizing an Algorithmic Technique: The Bayes Classifier as Interested Reading of Reality. Information, Communication & Society, 20(1), 100-117. https://doi.org/10.1080/1369118X.2016.1181195
- Sadre-Orafai, S. (2020). Typologies, Typifications, and Types. Annual Review of Anthropology, 49(1), 193-208. https://doi.org/10.1146/annurev-anthro-102218-011235
b. Critical Data Studies
Here, we introduce a few titles from the emerging field of Critical Data Studies which we believe are especially useful for the purposes of acquiring a nuanced and interdisciplinary understanding of datasets.
- Andrejevic, M. (2019). Automated Media (1st edition). Routledge.
- Beer, D. (2018). The Data Gaze. London, UK: SAGE.
- Cheney-Lippold, J. (2017). We Are Data: Algorithms and the Making of our Digital Selves. New York, NY: NYU Press.
- Chun, W. (2021). Discriminating Data. Cambridge, MA: MIT Press.
- Cifor, M., Garcia, P., Cowan, T. L., Rault, J., Sutherland, T., Chan, A., . . . Nakamura, L. (2019). Feminist Data Manifest-No. Retrieved from https://www.manifestno.com/
- Couldry, N., & Mejias, U. A. (2019). The Costs of Connection: How Data Is Colonizing Human Life and Appropriating It for Capitalism. Stanford, CA: Stanford University Press.
- D’Ignazio, C., & Klein, L. F. (2020). Data Feminism. MIT Press.
- Gitelman L. (2013). “Raw Data” Is an Oxymoron. MIT Press.
- Hansson, K., & Dahlgren, A. (2022). Open research data repositories: Practices, norms, and metadata for sharing images. Journal of the Association for Information Science and Technology , 73(2), 303-316. https://doi.org/10.1002/asi.24571
- Iliadis, A., & Russo, F. (2016). Critical data studies: An introduction. Big Data & Society, 3(2), 1-7. https://doi.org/10.1177/2053951716674238
- Jaton, F. (2021). The Constitution of Algorithms: Ground-Truthing, Programming, Formulating. Cambridge, MA: MIT Press.
- Kitchin, R. (2021). Data Lives. Bristol, UK: Bristol University Press.
- Koopman, C. (2019). How We Became Our Data: A Genealogy of the Informational Person. Chicago, IL: University of Chicago Press.
- Thorp, J. (2021). Living in Data: A Citizen's Guide to a Better Information Future. New York, NY: MCD.
c. Methodologies for Reading Data
This final subsection includes texts that deal more specifically with the different conceptualizations and methodologies through which datasets can be studied/read/analyzed.
- boyd, d., & Crawford, K. (2012). Critical Questions for Big Data: Provocations for a Cultural, Technological, and Scholarly Phenomenon. Information, Communication & Society, 15(5), 662-679. https://doi.org/10.1080/1369118X.2012.678878
- Brock A. (2015). Deeper Data: A Response to boyd and Crawford. Media, Culture & Society, 37(7):1084-1088. https://doi.org/10.1177/0163443715594105
- Driscoll, K., & Walker, S. (2014). Working Within a Black Box: Transparency in the Collection and Production of Big Twitter Data. International Journal of Communication, 8, 1745–1764. https://ijoc.org/index.php/ijoc/article/view/2171/1159
- Kitchin, R. (2014). The Data Revolution: Big Data, Open Data, Data Infrastructures and Their Consequences. London, UK: SAGE.
- Leonelli, S., & Tempini, N. (Eds.). (2020). Data Journeys in the Sciences. Springer International Publishing.
- Malevé, N. (2020). On the Data Set’s Ruins. AI & Society, 36, 1117–1131. https://doi.org/10.1007/s00146-020-01093-w
- Metcalf, J., & Crawford, K. (2016). Where Are Human Subjects in Big Data Research? The Emerging Ethics Divide. Big Data & Society, 3(1), 1-14. https://doi.org/10.1177/2053951716650211
- Munk, A. K., Olesen, A. G., & Jacomy, M. (2022). The Thick Machine: Anthropological AI Between Explanation and Explication. Big Data & Society, 9(1), 1-14. https://doi.org/10.1177/20539517211069891
- Pasquale, F. (2021). Licensure as Data Governance. Knight First Amendment Institute. https://knightcolumbia.org/content/licensure-as-data-governance
- Poirier, L. (2021). Reading Datasets: Strategies for Interpreting the Politics of Data Signification. Big Data & Society, 8(2), 1-19. https://doi.org/10.1177/20539517211029322
- Suchman, L., & Trigg, R. H. (1993). Artificial Intelligence as Craftwork. In S. Chaiklin & J. Lave (Eds.), Understanding Practice (pp. 144-178). New York, NY: Cambridge University Press.
- Zook, M., Barocas, S., boyd, d., Crawford, K., Keller, E., Gangadharan, S. P., Goodman, A., Hollander, R., Koenig, B. A., Metcalf, J., Narayanan, A., Nelson, A., & Pasquale, F. (2017). Ten Simple Rules for Responsible Big Data Research. PLOS Computational Biology, 13(3), e1005399. https://doi.org/10.1371/journal.pcbi.1005399
ANALYSES OF TRAINING DATASETS
This section highlights works that analyze training datasets from a variety of methodological and theoretical perspectives. While we understand that many of the titles that span across the major headings in this reading list involve some form of “dataset analysis,” we highlight in this particular section studies in which the analysis itself comprises the thrust of the article/chapter/work. The works in this section focus primarily on the details of the analysis as opposed to conducting an analysis as a preliminary step to introduce a more central argument or intervention.
a. Sociotechnical & Critical Studies
This subsection focuses on articles and chapters that approach their analyses of training datasets grounded in frameworks primarily taken from critical studies or science and technology studies.
- Bao, M., Zhou, A., Zottola, S. A., Brubach, B., Desmarais, S., Horowitz, A., Lum, K., & Venkatasubramanian, S. (2021). It’s COMPASlicated: The Messy Relationship between RAI Datasets and Algorithmic Fairness Benchmarks. ArXiv. https://arxiv.org/abs/2106.05498
- Busch, L. (2014). A Dozen Ways to Get Lost in Translation: Inherent Challenges in Large Scale Data Sets. International Journal of Communication, 8, 1727-1744. https://ijoc.org/index.php/ijoc/article/view/2160
- Coleman, C. N. (2020). Managing Bias When Library Collections Become Data. International Journal of Librarianship, 5(1), 8–19. https://doi.org/10.23974/ijol.2020.vol5.1.162
- Coveney, P. V., Dougherty, E. R., & Highfield, R. R. (2016). Big Data Need Big Theory Too. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 374, 1-11. https://doi.org/10.1098/rsta.2016.0153
- Feinberg, M. (2017). A Design Perspective on Data. Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, 2952–2963. https://doi.org/10.1145/3025453.3025837
- Jo, E. S., & Gebru, T. (2020). Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning. Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, 306–316. https://doi.org/10.1145/3351095.3372829
- Prabhu, V. U., & Birhane, A. (2020). Large Image Datasets: A Pyrrhic Win for Computer Vision? ArXiv. http://arxiv.org/abs/2006.16923
- Richardson, R., Schultz, J. M., & Crawford, K. (2019). Dirty Data, Bad Predictions: How Civil Rights Violations Impact Police Data, Predictive Policing Systems, and Justice. NYU Law Review, 94(15), 15–55. https://www.nyulawreview.org/online-features/dirty-data-bad-predictions-how-civil-rights-violations-impact-police-data-predictive-policing-systems-and-justice/
- Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P. K., & Aroyo, L. (2021). “Everyone Wants to Do the Model Work, Not the Data Work”: Data Cascades in High-Stakes AI. Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, 1-15. https://doi.org/10.1145/3411764.3445518
- Scheuerman, M. K., Denton, E., & Hanna, A. (2021). Do Datasets Have Politics? Disciplinary Values in Computer Vision Dataset Development. ArXiv. https://doi.org/10.1145/3476058
- Scheuerman, M. K., Paul, J. M., & Brubaker, J. R. (2019). How Computers See Gender: An Evaluation of Gender Classification in Commercial Facial Analysis Services. Proceedings of the ACM on Human-Computer Interaction, 3(CSCW), 1-33. https://doi.org/10.1145/3359246
- Scheuerman, M. K., Wade, K., Lustig, C., & Brubaker, J. R. (2020). How We’ve Taught Algorithms to See Identity: Constructing Race and Gender in Image Databases for Facial Analysis. Proceedings of the ACM on Human-Computer Interaction, 4(CSCW1), 1-35. https://doi.org/10.1145/3392866
- Smits, T., & Wevers, M. (2021). The Agency of Computer Vision Models as Optical Instruments. Visual Communication, 1-21. https://doi.org/10.1177/1470357221992097
- Stevens, N., & Keyes, O. (2021). Seeing infrastructure: Race, Facial Recognition and the Politics of Data. Cultural Studies, 35(4-5), 833-853. https://doi.org/10.1080/09502386.2021.1895252
- Trewin, S. (2018). AI Fairness for People with Disabilities: Point of View. ArXiv. http://arxiv.org/abs/1811.10670
b. Technical Approaches to Studying Datasets
Here, we introduce works that detail “technical” methods for the study of datasets. While the titles housed under the following subsection 5c, “Technical Audits,” deal with the investigative technical analysis of particular datasets, the works in this subsection are more concerned with introducing technical methods to approach the study of datasets and their particular components. Many of these studies do contain audit-style analyses, but we differentiate them from subsection 5c because their focus is on introducing or using technical methods for dataset analysis in general, as opposed to dissecting various components of particular datasets.
- Balayn, A., Kulynych, B., & Guerses, S. (2021). Exploring Data Pipelines through the Process Lens: A Reference Model for Computer Vision. ArXiv. https://arxiv.org/abs/2107.01824
- Bender, E. M., Gebru, T., McMillan-Major, A., & Mitchell, M. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜. FAccT. https://doi.org/10.1145/3442188.3445922
- Blodgett, S. L., Lopez, G., Olteanu, A., Sim, R., & Wallach, H. (202). Stereotyping Norwegian Salmon: An Inventory of Pitfalls in Fairness Benchmark Datasets. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, 1, 1004-1015.
- Cheng, V., Suriyakumar, V., Dullerud, N., Joshi, S., & Ghassemi, M. (2021). Can You Fake It Until You Make It?: Impacts of Differentially Private Synthetic Data on Downstream Classification Fairness. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 149-160. https://doi.org/10.1145/3442188.3445879
- Gardner, M., Merrill, W., Dodge, J., Peters, M. E., Ross, A., Singh, S., & Smith, N. A. (2021). Competency Problems: On Finding and Removing Artifacts in Language Data. ArXiv. https://arxiv.org/abs/2104.08646
- Fabbrizzi, S., Papadopoulos, S., Ntoutsi, E., & Kompatsiaris, Y. (2021). A Survey on Bias in Visual Datasets. ArXiv. https://arxiv.org/abs/2107.07919
- Hirota, Y., Nakashima, Y., & Garcia, N. (2022). Gender and Racial Bias in Visual Question Answering Datasets. 2022 ACM Conference on Fairness, Accountability, and Transparency, 1280–1292. https://doi.org/10.1145/3531146.3533184
- Hutchinson, B., Rostamzadeh, N., Greer, C., Heller, K., & Prabhakaran, V. (2022). Evaluation Gaps in Machine Learning Practice. 2022 ACM Conference on Fairness, Accountability, and Transparency, 1859–1876. https://doi.org/10.1145/3531146.3533233
- Jung, T., Kang, D., Mentch, L., & Hovy, E. (2019). Earlier Isn’t Always Better: Sub-aspect Analysis on Corpus and System Biases in Summarization. ArXiv. http://arxiv.org/abs/1908.11723
- Kilgarriff, A., & Grefenstette, G. (2003). Introduction to the Special Issue on the Web as Corpus. Computational Linguistics, 29(3), 333–348. https://doi.org/10.1162/089120103322711569
- Koesten, L., Vougiouklis, P., Simperl, E., & Groth, P. (2020). Dataset Reuse: Toward Translating Principles to Practice. Patterns, 1(8), 100136. https://doi.org/10.1016/j.patter.2020.100136
- Laranjeira da Silva, C., Macedo, J., Avila, S., & dos Santos, J. (2022). Seeing without Looking: Analysis Pipeline for Child Sexual Abuse Datasets. 2022 ACM Conference on Fairness, Accountability, and Transparency, 2189–2205. https://doi.org/10.1145/3531146.3534636
- Madras, D., Creager, E., Pitassi, T., & Zemel, R. (2019). Fairness through Causal Awareness: Learning Causal Latent-Variable Models for Biased Data. Proceedings of the Conference on Fairness, Accountability, and Transparency, 349–358. https://doi.org/10.1145/3287560.3287564
- Moreno-Torres, J. G., Raeder, T., Alaiz-Rodríguez, R., Chawla, N. V., & Herrera, F. (2012). A Unifying View on Dataset Shift in Classification. Pattern Recognition, 45(1), 521–530. https://doi.org/10.1016/j.patcog.2011.06.019
- Olson, R. S., La Cava, W., Orzechowski, P., Urbanowicz, R. J., & Moore, J. H. (2017). PMLB: A Large Benchmark Suite for Machine Learning Evaluation and Comparison. BioData Mining, 10(36). https://doi.org/10.1186/s13040-017-0154-4
- Rabanser, S., Günnemann, S., & Lipton, Z. C. (2019). Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift. ArXiv. http://arxiv.org/abs/1810.11953
- Rieke, A., Sutherland, V., Svirsky, D., & Hsu, M. (2022). Imperfect Inferences: A Practical Assessment. 2022 ACM Conference on Fairness, Accountability, and Transparency, 767-777. https://doi.org/10.1145/3531146.3533140
- Straw, I., & Callison-Burch, C. (2020). Artificial Intelligence in Mental Health and the Biases of Language Based Models. PLOS ONE, 15(12), e0240376. https://doi.org/10.1371/journal.pone.0240376
- Welty, C., Paritosh, P., & Aroyo, L. (2019). Metrology for AI: From Benchmarks to Instruments. ArXiv. https://arxiv.org/abs/1911.01875v1
- Wesley, A. M., & Matisziw, T. C. (2021). Methods for Measuring Geodiversity in Large Overhead Imagery Datasets. IEEE Access, 9, 100279–100293. https://doi.org/10.1109/ACCESS.2021.3096034
- Zanella-Béguelin, S., Wutschitz, L., Tople, S., Rühle, V., Paverd, A., Ohrimenko, O., Köpf, B., & Brockschmidt, M. (2020). Analyzing Information Leakage of Updates to Natural Language Models. Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, 363–375. https://doi.org/10.1145/3372297.3417880
- Zhong, R., Chen, Y., Patton, D., Selous, C., & McKeown, K. (2019). Detecting and Reducing Bias in a High Stakes Domain. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 4765–4775. https://doi.org/10.18653/v1/D19-1483
c. Technical Audits
This subsection includes works that employ technical audit-style investigations (e.g., Buolamwini & Gebru, 2018; Raji et al, 2020) of particular datasets.
- Babaeianjelodar, M., Lorenz, S., Gordon, J., Matthews, J., & Freitag, E. (2020). Quantifying Gender Bias in Different Corpora. Companion Proceedings of the Web Conference 2020, 752–759. https://doi.org/10.1145/3366424.3383559
- Bountouridis, D., Makhortykh, M., Sullivan, E., Harambam, J., Tintarev, N., & Hauff, C. (2019). Annotating Credibility: Identifying and Mitigating Bias in Credibility Datasets. ROME 2019 - Workshop on Reducing Online Misinformation Exposure. https://rome2019.github.io/papers/Bountouridis_etal_ROME2019.pdf
- Buolamwini, J., & Gebru, T. (2018, January). Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. Conference on fairness, accountability and transparency, 77-91. https://www.media.mit.edu/publications/gender-shades-intersectional-accuracy-disparities-in-commercial-gender-classification/
- Costanza-Chock, S., Raji, I. D., & Buolamwini, J. (2022). Who Audits the Auditors? Recommendations from a field scan of the algorithmic auditing ecosystem. 2022 ACM Conference on Fairness, Accountability, and Transparency, 1571–1583. https://doi.org/10.1145/3531146.3533213
- Davidson, T., Bhattacharya, D., & Weber, I. (2019). Racial Bias in Hate Speech and Abusive Language Detection Datasets. ArXiv. http://arxiv.org/abs/1905.12516
- Dulhanty, C., & Wong, A. (2019). Auditing ImageNet: Towards a Model-driven Framework for Annotating Demographic Attributes of Large-Scale Image Datasets. ArXiv. http://arxiv.org/abs/1905.01347
- Dulhanty, C., & Wong, A. (2020). Investigating the Impact of Inclusion in Face Recognition Training Data on Individual Face Identification. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, 244–250. https://doi.org/10.1145/3375627.3375875
- Dulhanty, C. (2020). Issues in Computer Vision Data Collection: Bias, Consent, and Label Taxonomy [University of Waterloo]. https://uwspace.uwaterloo.ca/handle/10012/16414
- Heinzerling, B. (2019, July 21). NLP’s Clever Hans Moment has Arrived. Benjamin Heinzerling. https://bheinzerling.github.io/post/clever-hans/
- Hutchinson, B., Prabhakaran, V., Denton, E., Webster, K., Zhong, Y., & Denuyl, S. (2020). Social Biases in NLP Models as Barriers for Persons with Disabilities. ArXiv. http://arxiv.org/abs/2005.00813
- Klockmann, V., von Schenk, A., & Villeval, M. C. (2021). Artificial Intelligence, Ethics, and Diffused Pivotality. Working Paper Series, GATE. https://ssrn.com/abstract=3853829
- Luccioni, A., & Viviano, J. (2021). What’s in the Box? A Preliminary Analysis of Undesirable Content in the Common Crawl Corpus. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 182-189. https://aclanthology.org/2021.acl-short.24.pdf
- Mecati, M., Cannavò, F. E., Vetrò, A., & Torchiano, M. (2020). Identifying Risks in Datasets for Automated Decision–Making. In G. Viale Pereira, M. Janssen, H. Lee, I. Lindgren, M. P. Rodríguez Bolívar, H. J. Scholl, & A. Zuiderwijk (Eds.), Electronic Government (pp. 332–344). Springer International Publishing. https://doi.org/10.1007/978-3-030-57599-1_25
- Raji, I. D., & Fried, G. (2021). About Face: A Survey of Facial Recognition Evaluation. ArXiv. http://arxiv.org/abs/2102.00813
- Raji, I. D., Gebru, T., Mitchell, M., Buolamwini, J., Lee, J., & Denton, E. (2020). Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing. ArXiv. http://arxiv.org/abs/2001.00964
- Rambachan, A., & Roth, J. (2020). Bias In, Bias Out? Evaluating the Folk Wisdom. ArXiv. https://doi.org/10.4230/LIPIcs.FORC.2020.6
- Shankar, S., Halpern, Y., Breck, E., Atwood, J., Wilson, J., & Sculley, D. (2017). No Classification without Representation: Assessing Geodiversity Issues in Open Data Sets for the Developing World. ArXiv. https://arxiv.org/abs/1711.08536
- Vidgen, B., & Derczynski, L. (2020). Directions in Abusive Language Training Data: Garbage In, Garbage Out. ArXiv. https://arxiv.org/abs/2004.01670
- Wang, T., Zhao, J., Yatskar, M., Chang, K.-W., & Ordonez, V. (2019). Balanced Datasets Are Not Enough: Estimating and Mitigating Gender Bias in Deep Image Representations. 2019 IEEE/CVF International Conference on Computer Vision (ICCV). https://doi.org/10.1109/ICCV.2019.00541
d. Visual & Artistic Approaches to Datasets
This final subsection assembles artistic and visual approaches/formats for the analysis of datasets.
- Baker, D. (2022). Datasets Have Worldviews [Website]. PAIR Explorables. https://pair.withgoogle.com/explorables/dataset-worldviews/
- Crawford, K. & Paglen, T. (2019). Training Humans [Large-scale exhibition]. Fondazione Prada, Milan, 2019-2020. https://www.fondazioneprada.org/project/training-humans/?lang=enPublication: Training Humans Book
- Dewey-Hagbord, H. (2019). How Do You See Me? [Adversarial processes]. The Photographer’s Gallery, London, UK. https://thephotographersgallery.org.uk/whats-on/heather-dewey-hagborg-how-do-you-see-me
- Malevé, N. (2019).12 hours of ImageNet [Computer script]. The Photographer’s Gallery, London, UK. https://thephotographersgallery.org.uk/whats-on/exhibiting-imagenet
- Paglen, T. and Crawford, K. (2019). Imagenet Roulette [Software program]. Launched at SXSW. https://www.youtube.com/watch?v=S0yEPZJnvgs
- Pipkin, E. (2020). On Lacework: Watching an Entire Machine-Learning Dataset. Unthinking Photography. https://unthinking.photography/articles/on-lacework
- Ridler, A. (2018). Myriad (Tulips) [C-type digital prints with handwritten annotations, magnetic paint, magnets]. Barbican Centre, London, UK. http://annaridler.com/myriad-tulips
RESPONSES TO DATASET PROBLEMS
Here we assemble literature that proposes responses to commonly identified sociotechnical problems with ML datasets. Most of the articles in this vein focused on technical responses to addressing bias (writ broadly), while a few address other concerns such as privacy and security. We do not necessarily endorse these approaches; rather, this is a loose mapping of emerging areas of focus in response to problems. Note that there is some overlap with the readings suggested in Section 5, as many of these papers investigate particular datasets; however, the papers listed here emphasize approaches to addressing specific problems.
a. General Recommendations for Dataset Design
This subsection covers miscellaneous broad recommendations for the creation of fairer and more accountable datasets.
- Andrus, M., & Villeneuve, S. (2022). Demographic-Reliant Algorithmic Fairness: Characterizing the Risks of Demographic Data Collection in the Pursuit of Fairness. 2022 ACM Conference on Fairness, Accountability, and Transparency, 1709–1721. https://doi.org/10.1145/3531146.3533226
- Bilstrup, K.-E. K., Kaspersen, M. H., Assent, I., Enni, S., & Petersen, M. G. (2022). From Demo to Design in Teaching Machine Learning. 2022 ACM Conference on Fairness, Accountability, and Transparency, 2168–2178. https://doi.org/10.1145/3531146.3534634
- Bowman, S. R., & Dahl, G. E. (2021). What Will it Take to Fix Benchmarking in Natural Language Understanding? NAACL. https://doi.org/10.18653/V1/2021.NAACL-MAIN.385
- Boyd, K. (2022). Designing Up with Value-Sensitive Design: Building a Field Guide for Ethical ML Development. 2022 ACM Conference on Fairness, Accountability, and Transparency, 2069–2082. https://doi.org/10.1145/3531146.3534626
- Kiela, D., Bartolo, M., Nie, Y., Kaushik, D., Geiger, A., Wu, Z., Vidgen, B., Prasad, G., Singh, A., Ringshia, P., Ma, Z., Thrush, T., Riedel, S., Waseem, Z., Stenetorp, P., Jia, R., Bansal, M., Potts, C., & Williams, A. (2021). Dynabench: Rethinking Benchmarking in NLP. NAACL. https://doi.org/10.18653/V1/2021.NAACL-MAIN.324
- Panch, T., Pollard, T. J., Mattie, H., Lindemer, E., Keane, P. A., & Celi, L. A. (2020). “Yes, But Will It Work for My Patients?” Driving Clinically Relevant Research with Benchmark Datasets. Npj Digital Medicine, 3(1), 1–4. https://doi.org/10.1038/s41746-020-0295-6
- Peng, K., Mathur, A., & Narayanan, A. (2021). Mitigating Dataset Harms Requires Stewardship: Lessons from 1000 Papers. ArXiv. http://arxiv.org/abs/2108.02922
- Rogers, A. (2020). Changing the World by Changing the Data. ArXiv. https://arxiv.org/abs/2105.13947
- Rolf, E., Worledge, T., Recht, B., & Jordan, M. I. (2021). Representation Matters: Assessing the Importance of Subgroup Allocations in Training Data. ArXiv. https://arxiv.org/abs/2103.03399
- Selbst, A. D., Boyd, D., Friedler, S. A., Venkatasubramanian, S., & Vertesi, J. (2019). Fairness and Abstraction in Sociotechnical Systems. Proceedings of the Conference on Fairness, Accountability, and Transparency, 59–68. https://doi.org/10.1145/3287560.3287598
- Suresh, H., Movva, R., Lee Dogan, A., Bhargava, D., Isadora, C., Martinez Cuba, A., Taurino, G., So, W., & D’Ignazio, C. (2022). Towards Intersectional Feminist and Participatory ML: A Case Study in Supporting Femicide Counterdata Collection. 2022 ACM Conference on Fairness, Accountability, and Transparency, 667-678. https://doi.org/10.1145/3531146.3533132
- Stasaski, K., Yang, G. H., & Hearst, M. A. (2020). More Diverse Dialogue Datasets via Diversity-Informed Data Collection. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 4958–4968. https://doi.org/10.18653/v1/2020.acl-main.446
b. Creating New Datasets and/or Remediation of Existing Datasets
This subsection includes articles that either remediate specific existing datasets or detail the creation of alternative datasets to address identified privacy and bias issues.
- Asano, Y., Rupprecht, C., Zisserman, A., & Vedaldi, A. (2021). PASS: An ImageNet Replacement for Self-Supervised Pretraining Without Humans. ArXiv. https://arxiv.org/abs/2109.13228
- Brown, H., Lee, K., Mireshghallah, F., Shokri, R., & Tramèr, F. (2022). What Does it Mean for a Language Model to Preserve Privacy? 2022 ACM Conference on Fairness, Accountability, and Transparency, 2280–2292. https://doi.org/10.1145/3531146.3534642
- Cai, W., Encarnacion, R., Chern, B., Corbett-Davies, S., Bogen, M., Bergman, S., & Goel, S. (2022). Adaptive Sampling Strategies to Construct Equitable Training Datasets. 2022 ACM Conference on Fairness, Accountability, and Transparency, 1467–1478. https://doi.org/10.1145/3531146.3533203
- Jernite, Y., Nguyen, H., Biderman, S., Rogers, A., Masoud, M., Danchev, V., Tan, S., Luccioni, A. S., Subramani, N., Johnson, I., Dupont, G., Dodge, J., Lo, K., Talat, Z., Radev, D., Gokaslan, A., Nikpoor, S., Henderson, P., Bommasani, R., & Mitchell, M. (2022). Data Governance in the Age of Large-Scale Data-Driven Language Technology. 2022 ACM Conference on Fairness, Accountability, and Transparency, 2206–2222. https://doi.org/10.1145/3531146.3534637
- Khashabi, D., Chaturvedi, S., Roth, M., Upadhyay, S., & Roth, D. (2018). Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1, 252–262. https://doi.org/10.18653/v1/N18-1023
- Yang, K., Qinami, K., Fei-Fei, L., Deng, J., & Russakovsky, O. (2020). Towards Fairer Datasets: Filtering and Balancing the Distribution of the People Subtree in the ImageNet Hierarchy. FAT* '20: Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, 547-558. https://doi.org/10.1145/3351095.3375709
- Yang, K., Yau, J., Fei-Fei, L., Deng, J., & Russakovsky, O. (2021). A Study of Face Obfuscation in ImageNet. ArXiv. https://arxiv.org/abs/2103.06191
- Zellers, R., Bisk, Y., Schwartz, R., & Choi, Y. (2018). SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference. ArXiv. https://arxiv.org/abs/1808.05326v1
c. Data Annotation Workflows
Articles in this subsection address biased machine learning datasets by proposing changes to data annotation processes.
- Barbosa, N. M., & Chen, M. (2019). Rehumanized Crowdsourcing: A Labeling Framework Addressing Bias and Ethics in Machine Learning. Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, 1–12. http://doi.org/10.1145/3290605.3300773
- Beretta, E., Vetrò, A., Lepri, B., & Martin, J. C. D. (2021). Detecting Discriminatory Risk Through Data Annotation Based on Bayesian Inferences. FAccT. https://doi.org/10.1145/3442188.3445940
- Beretta, E., Vetrò, A., Lepri, B., & De Martin, J. C. (2019). Ethical and Socially-Aware Data Labels. In J. A. Lossio-Ventura, D. Muñante, & H. Alatrista-Salas (Eds.), Information Management and Big Data, 320–327. Springer International Publishing. https://doi.org/10.1007/978-3-030-11680-4_30
- Rateike, M., Majumdar, A., Mineeva, O., Gummadi, K. P., & Valera, I. (2022). Don’t Throw it Away! The Utility of Unlabeled Data in Fair Decision Making. 2022 ACM Conference on Fairness, Accountability, and Transparency, 1421–1433. https://doi.org/10.1145/3531146.3533199
d. Data Augmentation
Articles in this subsection offer approaches to reducing bias in datasets by changing their composition via techniques such as oversampling or the use of synthetic/pseudo-data.
- Iosifidis, V., & Ntoutsi, E. (2018). Dealing with Bias via Data Augmentation in Supervised Learning Scenarios. http://ceur-ws.org/Vol-2103/paper_5.pdf
- Pastaltzidis, I., Dimitriou, N., Quezada-Tavarez, K., Aidinlis, S., Marquenie, T., Gurzawska, A., & Tzovaras, D. (2022). Data augmentation for fairness-aware machine learning: Preventing algorithmic bias in law enforcement systems. 2022 ACM Conference on Fairness, Accountability, and Transparency, 2302–2314. https://doi.org/10.1145/3531146.3534644
- Sharma, S., Zhang, Y., Ríos Aliaga, J. M., Bouneffouf, D., Muthusamy, V., & Varshney, K. R. (2020). Data Augmentation for Discrimination Prevention and Bias Disambiguation. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, 358–364. https://doi.org/10.1145/3375627.3375865
- Tomalin, M., Byrne, B., Concannon, S., Saunders, D., & Ullmann, S. (2021). The Practical Ethics of Bias Reduction in Machine Translation: Why Domain Adaptation is Better than Data Debiasing. Ethics and Information Technology, 23, 419-433. https://doi.org/10.1007/s10676-021-09583-1
e. Bias Detection
This subsection gathers tools and approaches for detecting bias in datasets.
- Chapman, A., Grylls, P., Ugwudike, P., Gammack, D., & Ayling, J. (2022). A Data-Driven Analysis of the Interplay Between Criminology Theory and Predictive Policing Algorithms. 2022 ACM Conference on Fairness, Accountability, and Transparency, 36-45. https://doi.org/10.1145/3531146.3533071
- Goyal, P., Romero Soriano, A., Hazirbas, C., Levent, S., & Usunier, N. (2022). Fairness Indicators for Systematic Assessments of Visual Feature Extractors. 2022 ACM Conference on Fairness, Accountability, and Transparency, 70-88. https://doi.org/10.1145/3531146.3533074
- Harris, C., Halevy, M., Howard, A., Bruckman, A., & Yang, D. (2022). Exploring the Role of Grammar and Word Choice in Bias Toward African American English (AAE) in Hate Speech Classification. 2022 ACM Conference on Fairness, Accountability, and Transparency, 789-798. https://doi.org/10.1145/3531146.3533144
- Hu, X., Wang, H., Vegesana, A., Dube, S., Yu, K., Kao, G., Chen, S.-H., Lu, Y.-H., Thiruvathukal, G. K., & Yin, M. (2020). Crowdsourcing Detection of Sampling Biases in Image Datasets. Proceedings of The Web Conference 2020, 2955–2961. https://doi.org/10.1145/3366423.3380063
- Leavy, S., Meaney, G., Wade, K., & Greene, D. (2020). Mitigating Gender Bias in Machine Learning Data Sets. In L. Boratto, S. Faralli, M. Marras, & G. Stilo (Eds.), Bias and Social Aspects in Search and Recommendation, 12–26. Springer International Publishing. https://doi.org/10.1007/978-3-030-52485-2_2
- Pahl, J., Rieger, I., Mӧller, A., Wittenberg, T., & Schmid, U. (2022). Female, White, 27? Bias Evaluation on Data and Algorithms for Affect Recognition in Faces. 2022 ACM Conference on Fairness, Accountability, and Transparency, 973-987. https://doi.org/10.1145/3531146.3533159
- Srinivasan, R., & Chander, A. (n.d.). Understanding Bias in Datasets using Topological Data Analysis. 7. http://ceur-ws.org/Vol-2419/paper_9.pdf
- Verma, S., Ernst, M., & Just, R. (2021). Removing Biased Data to Improve Fairness and Accuracy. ArXiv. https://arxiv.org/abs/2102.03054
- Wang, A., Barocas, S., Laird, K., & Wallach, H. (2022). Measuring Representational Harms in Image Captioning. 2022 ACM Conference on Fairness, Accountability, and Transparency, 324-335. https://doi.org/10.1145/3531146.3533099
- Wang, A., Narayanan, A., & Russakovsky, O. (2020). REVISE: A Tool for Measuring and Mitigating Bias in Visual Datasets. ECCV, 733-751. https://doi.org/10.1007/978-3-030-58580-8_43
- Wang, A., Ramaswamy, V. V., & Russakovsky, O. (2022). Towards Intersectionality in Machine Learning: Including More Identities, Handling Underrepresetation, and Performing Evaluation. 2022 ACM Conference on Fairness, Accountability, and Transparency, 336-349. https://doi.org/10.1145/3531146.3533101
- Zamfirescu-Pereira, J. D., Chen, J., Wen, E, Koenecke, A., Garg, N., & Pierson, E. (2022) Trucks Don’t Mean Trump: Diagnosing Human Error in Image Analysis. 2022 ACM Conference on Fairness, Accountability, and Transparency, 799-813. https://doi.org/10.1145/3531146.3533145
f. Algorithms to Debias Datasets or Mitigate Bias
Research in this subsection deploys algorithmic techniques to either debias datasets before training ML models on them or intervene to mitigate bias after training.
- Abbasi-Sureshjani, S., Raumanns, R., Michels, B. E. J., Schouten, G., & Cheplygina, V. (2020). Risk of Training Diagnostic Algorithms on Data with Demographic Bias. In J. Cardoso et al (Eds.), Interpretable and Annotation-Efficient Learning for Medical Image Computing, 183–192. Springer. https://doi.org/10.1007/978-3-030-61166-8_20
- Almuzaini, A. A., Bhatt, C. A., Pennock, D. M., & Singh, V. K. (2022). ABCinML: Anticipatory Bias Correction in Machine Learning Applications. 2022 ACM Conference on Fairness, Accountability, and Transparency, 1552–1560. https://doi.org/10.1145/3531146.3533211
- Anahideh, H., Asudeh, A., & Thirumuruganathan, S. (2021). Fair Active Learning. ArXiv. http://arxiv.org/abs/2001.0179
- Bolukbasi, T., Chang, K.-W., Zou, J., Saligrama, V., & Kalai, A. (2016). Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings. ArXiv. http://arxiv.org/abs/1607.06520
- Hendricks, L. A., Burns, K., Saenko, K., Darrell, T., & Rohrbach, A. (2018). Women Also Snowboard: Overcoming Bias in Captioning Models. ECCV, 771–787. https://openaccess.thecvf.com/content_ECCV_2018/html/Lisa_Anne_Hendricks_Women_also_Snowboard_ECCV_2018_paper.html
- Lum, K., Zhang, Y., & Bower, A. (2022). De-Biasing “Bias” Measurement. 2022 ACM Conference on Fairness, Accountability, and Transparency, 379-389. https://doi.org/10.1145/3531146.3533105
- Reimers, C., Bodesheim, P., Runge, J., & Denzler, J. (2021). Towards Learning an Unbiased Classifier from Biased Data via Conditional Adversarial Debiasing. ArXiv. https://arxiv.org/abs/2103.06179
- Ryu, H. J., Mitchell, M., & Adam, H. (2017). InclusiveFaceNet: Improving Face Attribute Detection with Race and Gender Diversity. ArXiv. https://arxiv.org/abs/1712.00193
- Schick, T., Udupa, S., & Schütze, H. (2021). Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP. ArXiv. https://arxiv.org/abs/2103.00453
- Sikdar, S., Lemmerich, F., & Strohmaier, M. (2022). GetFair: Generalized Fairness Tuning of Classification Models. 2022 ACM Conference on Fairness, Accountability, and Transparency, 289-299. https://doi.org/10.1145/3531146.3533094
- Zhao, J., Wang, T., Yatskar, M., Ordonez, V., & Chang, K.-W. (2017). Men Also Like Shopping: Reducing Gender Bias Amplification Using Corpus-level Constraints. ArXiv. http://arxiv.org/abs/1707.09457