"For there to be a market, there has to be something valuable to exchange"
No authors, no books, no training data
It seems like lately everything I do ends up being about generative AI—even when it’s also about something else. This weekend, I published a piece in The Boston Globe about my mother. It didn’t start out as a Mother’s Day piece. It started after I typed my mother’s name into the LibGen search tool the Atlantic created back in March. When one of her two books popped up, I started thinking about the experience of watching her work on that book—and more broadly of watching so many people I know write, rewrite, recast, and rethink their projects. When I wrote this, I wanted to capture something other than the argument about whether training data is subject to copyright laws, although I did write about that. When we read books, we’re somewhat aware that they’ve been written. But how much does that matter? What is the value of a life’s work? What changes when we read “output”?
The Globe has a paywall, so I’m sharing the full piece below.
When my mother died of cancer 20 years ago at age 62, she left behind a family that was not ready to lose her — and two academic history books that had been published by NYU Press. Academic books tend to be read by a fairly specialized (and small) audience, so I was surprised this spring — although I probably shouldn’t have been — to find one of her books in LibGen, the pirated document collection that Meta used to train its generative AI tool, Llama, and that OpenAI has used to train ChatGPT.
Although we don’t know which books in the database ultimately made it into Meta’s AI training data, we do know from documents in a lawsuit filed by a group of authors, including Sarah Silverman and Ta-Nehesi Coates, that Meta explored ways to acquire books by buying the rights to them before instead using LibGen’s collection of copyrighted works. They were in a hurry to get access to books; as one Meta employee wrote, the goal was to “get as much long form writing as possible in the next 4-6 weeks.”
It takes a lot longer than four to six weeks to write your own book. My mother, Linda Rosenzweig, worked on her first book for several years, spent much longer acquiring the expertise to write it, and didn’t publish it until she was 51 years old. And the story of how she came to write that book and why it took so long began well before she began to write it.
She was born in the 1940s at a time when most women were expected to get married, not write books. She loved history, and as a teen she dreamed of leaving Pittsburgh to attend a school like Vassar or Radcliffe. But her parents could only afford to send one child away to college — and they thought it was more important to send her brother.
Like many women of that generation, my mother did get married — right after she graduated from a local college — and she also became a teacher. After my sister was born, my mother finally took a step toward the bigger career she had always wanted and began a PhD program, which eventually led to a job as a professor at the same small women’s college she had attended. But it wasn’t until I was in college that she was able to begin the project she had by then long wanted to take on — to write a book about the history of the mother-daughter relationship. To conduct research for that book, she applied for grants and spent weeks at the Schlesinger Library at Radcliffe. It wasn’t the same as going to college there, but she did stay on campus, walking the same streets of Cambridge that she would have walked if she had been a student there years earlier.
I know what this experience was like for my mother because I visited her on one of those research trips, and over the next few years I read drafts of both of her books, printing out the pages of each chapter and writing my comments by hand. When I visited her in Cambridge, we sat in restaurants in Harvard Square talking about what she’d found that day in the archives. Writing was hard for her, as it is for most writers. She also had to overcome gaps in her training and gaps in her confidence before she got to the point of holding that published book in her hands.
The authors’ lawsuit against Meta, like a similar lawsuit against OpenAI, turns on questions of copyright and fair use — on whether writers should be compensated for the work that companies training large language models have used — and whether those companies had the right to use their work in the first place. In a statement on its website, OpenAI has argued that using copyrighted works to train large language models is fair use because “just as humans obtain a broad education to learn how to solve new problems, we want our AI models to observe the range of the world’s information, including from every language, culture, and industry.”
Lawyers for Meta have argued that individual authors should not be compensated for the use of their books to train large language models because “for there to be a market, there must be something of value to exchange, but none of Plaintiffs’ works has economic value, individually, as training data.” And yet, of course, without all of those authors, there are no books. And without all of those books, there is no training data.
But when a book is used to train an LLM, the AI model doesn’t “read” the book the way we would read the book — and it doesn’t obtain the kind of education or life experience that led my mother to write her book before it generates output. It’s not observing ”the range of the world’s information” in the way humans observe the world. Instead, when books are used to train an AI model, the words are turned into numerical representations that capture their relationships with other words, creating a system that predicts the likely next word in a string of words. To train these systems, companies like Meta use millions of pages of text authored by humans.
Lawyers for Meta have argued that individual authors should not be compensated for the use of their books to train large language models because “for there to be a market, there must be something of value to exchange, but none of Plaintiffs’ works has economic value, individually, as training data.” And yet, of course, without all of those authors, there are no books. And without all of those books, there is no training data.
I found my mother’s book in LibGen by using a search tool created by The Atlantic. You have likely never read my mother’s book, but if you search the collection, you will find many books you have read. The story of how my mother came to write her book is just one story about one author. But it’s also a piece of a much bigger story. When AI companies train their models by using books without permission from the authors, they’re not just stealing the words from the page; they’re devaluing the hours that writers spend thinking, drafting, and revising and the perseverance it takes to bring those books into existence.
Whether using my mother’s book — or any other book — as training data constitutes fair use under copyright law will be decided by the courts. But what the courts won’t rule on is what it will mean for reading and writing if we agree to see books as no more than strings of words to be chopped up and transformed into data for AI models. That is something we’ll have to decide for ourselves.
Thanks for reading this one.
Jane. Love this article and thank you so much for taking the time to write it and not string a bunch of words togethers. It's one thing to use and AI agent and upload your material versus the material being taken from you unaware.
This is a beautiful reflection - thanks for writing and sharing it, Jane.
Your final idea is thought-provoking ("But what the courts won’t rule on is what it will mean for reading and writing if we agree to see books as no more than strings of words to be chopped up and transformed into data for AI models.") I wonder: What stops us from seeing books as *both* the precious fruits of individual human cognitive effort and expression *and* as strings of words to be fed to models to make them stronger and more amazing in their abilities? The nightmare scenario is that we give up on reading books, preferring instead to read the outputs of generative AI. But I think humans have the hunger and the curiosity to know what other humans think, a hunger and a curiosity that won't be sated by gen AI. At least I hope.