On August 30, 2023, the United States Copyright Office (USCO) requested public input on copyright law and artificial intelligence (AI), especially recent generative AI systems.
In this Comment, the Knowing Machines Project (Knowing Machines) urges USCO to rely on research-based, empirical findings to inform its regulatory agenda and any recommendations to Congress on the open issue of the use of copyright-protected works to train AI models. USCO should advocate for support and funding to develop data investigatory tools to inform its assessment of training datasets for generative AI systems (GenAI) and their potential impact on the copyright system as a whole. We briefly discuss Knowing Machines' experience building a training dataset investigatory tool, See:Set, to demonstrate some of the ways in which data investigations may provide empirical findings to support evidence-based policymaking. We also recommend USCO study and fund the development of best practices for dataset creation, curation, recordkeeping, and maintenance.
Our main point: "We understand the difficulties of gaining a deep understanding of these training datasets firsthand. We need new investigatory methods to uncover the hidden problems inscribed in machine learning processes. Because dataset creators and AI developers lack standardized ex ante dataset transparency and recordkeeping requirements, we now rely almost exclusively on ex post data investigations for research, often unable to identify all the necessary information we need to understand datasets, especially in a copyright context. Although it is challenging, we urge the USCO to support evidence-based research concerning the nature of training datasets and their role in GenAI outputs, minimizing the influence of conjecture in the policymaking process."