The shape of AI accountability and its contours in copyright

Johnny Tian-Zheng Wei

University of Southern California

The NLP Reading Group is excited to host Johnny Tian-Zheng Wei from University of Southern California, who will give a talk on memorization, test-set contamination, and the broader contours of AI accountability in copyright.

Logistics

Date: Friday March 20
Time: 2PM
Location: on Google Meet, to be screencast at Mila in A14

Abstract

How do we establish accountability for AI? While the shape of AI accountability at large remains amorphous, its contours are revealed in the ongoing copyright challenge to AI. In this talk, I’ll outline a legal theory of change and situate two works in this context. The first work focuses on the legal setup, theorizing how the judiciary can establish copyright accountability, by examining how LLM training decisions affect the model’s memorization. Further progress then depends on deriving best practices for measuring and mitigating undesirable memorization. The second work focuses on scientific follow up and our release of Hubble, a model suite to advance the study of LLM memorization. Hubble models are trained on English but also with controlled insertions of text designed to emulate key memorization risks. I’ll summarize the main findings and highlight the potential of controlled insertions for safety-critical concerns beyond copyright. Finally, I will present preliminary results on using these insertions to statistically adjust evaluation results under test set contamination.

Speaker Bio

Johnny Tian-Zheng Wei is a final year PhD student at USC, and his interdisciplinary research spans machine learning, statistics, and law. He has published in a range of conferences including AIES, FAccT, ICLR, and ACL, and recently led the open-source release of Hubble, which was supported by NVIDIA through the NAIRR pilot program. He also co-organized the First Workshop on LLM Memorization at ACL 2025.