Did you know OpenAI Trained AI Models by Memorizing Copyrighted Content, New Research Proves
A new study is speaking about OpenAI training some of its AI models by memorizing copyrighted material.
The
tech giant’s details about how it’s currently entangled in different
lawsuits put out by authors and programmers are going public. The
majority of them are accusing OpenAI of using material without consent,
including books and databases, to produce new models.
The
company has been dealing with such allegations for a long time and still
feels like they’ve done nothing wrong. In fact, the tech giant adds
that it’s developed models by fair use, but the plaintiffs beg to differ
on the matter. They’re arguing that there is no carve-out in America’s
copyright law linked to training data.
The research was first co-authored by experts at the University of
Washington, Stanford, and even University of Copenhagen. The latest
method is used for highlighting training data that models memorize
behind APIs.
So to make it easy, they are trained using so much
information, and they’re learning all sorts of patterns. This way, they
can assist in generating images, essays, and beyond. Most of them learn
by this technique only. Plenty of image models today were seen
regurgitating pictures from films that were used during their training
process. Meanwhile, large language models copied news reports.
This
research’s method is based on words that co-authors refer to as
high-surprisal. It’s mostly terms that pop out as unusual about a bigger
body of work. Radar can be linked to high surprisal as it’s less
prominent than terms like radio or engine that pop up before humming.
The co-authors mentioned how they investigated different OpenAI models
such as GPT-4 and 3.5. There were signs of memorization, like getting
rid of common terms from clippings of fiction books and New York Times
articles. They have models to guess which terms were masked. If such
models continue to be managed the right way, it’s more or less like
memorizing snippets throughout the training process, the co-authors
mentioned.
As per the results from tests, GPT-4 displayed signs
of seeing memorized parts of famous fiction books like books inside a
dataset featuring samples of copyrighted ebooks like BookMIA. The
replies also prove how the model might have literally memorized parts of
articles published by the New York Times. Even if that was at a much
lower rate.
Some of the authors of the study stated how these
findings prove that contentious data might be used for training AI
models. So to really gauge if these systems are reliable or not, we need
models that can be investigated and checked through scientific means.
The
work today provides a great tool for probing LLMs, but there’s a
greater need to be more transparent today than before. For so long,
we’ve seen OpenAI stand there and advocate for fewer restrictions on new
models using data protected by copyrights. While the firm has a lot of
content licensing deals today, it continues to lobby governments around
various AI training mechanisms.
