Secretive Chatbot Developers Are Making a Big Mistake

The clash between content creators and tech giants is likely to intensify.

Bloomberg News

July 27, 2023

5 Min Read
ChatGPT website on computer screen


(Bloomberg Opinion/Dave Lee) -- Tired of seeing their hard work pilfered by the tech sector’s artificial intelligence giants, the creative industry is starting to fight back. While on the surface its argument is about the principle of copyright, what the clash reveals is just how little we know about the data behind breakthrough tech like ChatGPT. The lack of transparency is getting worse, and it stands in the way of creatives being fairly paid and, ultimately, of AI being safe.

A trickle of legal challenges against AI companies could soon become a downpour. Media conglomerate IAC is reported to be teaming up with large publishers including the New York Times in a lawsuit alleging the improper use of their content to build AI-powered chatbots.

One reading of this is that publishers are running scared. The threat AI poses to their businesses is obvious: People who might have once read a newspaper’s restaurant reviews may now choose to ask an AI chatbot where to go to dinner, and so on.

But the bigger factor is that publishers are beginning to understand their value in the age of AI, albeit somewhat after the horse has bolted. AI models are only as good as the data put in them. Text and images produced by leading media organizations in theory should be of high quality and help AI tools like ChatGPT generate better results. If AI companies want to use great articles and photography, created by real people, they should be paying for the privilege. So far, for the most part, they haven’t been.

Related:Does AI-Assisted Coding Violate Open Source Licenses?

Forcing them to change is going to prove difficult, thanks to some willful acts of obfuscation. As AI has grown more sophisticated, transparency has taken a back seat. In a distinct departure from the early days of machine-learning research, when teams of computer scientists, such as the Transformer 8, went into intricate detail over the training data, leading AI developers are now using vague language about their sources.

OpenAI’s GPT-4 is trained “using publicly available data [such as internet data] as well as data we’ve licensed,” the company explained in its release notes for the model, revealing little else. Meta’s equivalent, the newly released Llama 2, was similarly vague. The company said it had been trained on a “new mix of data from publicly available sources.”

Contrast that with what Meta said in February when it unveiled the first version of Llama. Then, it broke down in a spreadsheet the various sources that had been used: 4.5% of the dataset, for example, consisted of 83 gigabytes-worth of Wikipedia articles, in 20 languages, scraped between June and August 2022.

Those old disclosures were enough to provoke two recent class action lawsuits fronted by comedian Sarah Silverman and two other authors. They argue that even those vague early descriptions from OpenAI and Meta about sources raised the likelihood the companies used the writers’ books without permission.

But it isn’t an exact science: Getting to the bottom of where training data for AI comes from is like unstacking a Russian nesting doll. By the time data is picked up by a company like OpenAI, it may have been gathered and processed by any number of smaller groups. Accountability becomes a lot more difficult. 

In the search for common sense regulation on AI, insisting on transparency seems like a straightforward place to start. Only by understanding what is in datasets can we begin to tackle the next step of limiting the potential harm of the technology. Knowing more about the data reveals not only the owners of that content but also any inherent flaws within, allowing outsiders to examine for bias or blind spots.

Plus, only by supporting the economy that creates content can more of it be sustainably made. The risk of “inbreeding” — where AI-generated text ends up training future models — could exacerbate quality control issues within large language models. “If they bankrupt the creative industry, they’ll end up bankrupting themselves,” said Matthew Butterick, one of the attorneys behind the Silverman effort. 

At a White House meeting last week, seven of the largest AI companies agreed to voluntary measures around safety, security and trust. Included were smart suggestions on pre-release testing, cybersecurity and disclosures to the end user on when something has been made by AI.

All good ideas. But what’s urgently needed are laws requiring standardized disclosures on what data sources have been used to train large language models. Otherwise, the pledges to avoid the same mistakes made with social media, when “black box” algorithms caused great societal damage, ring hollow. Senate Majority Leader Chuck Schumer is preparing sweeping regulations with a promise to take into consideration how to protect copyright and intellectual property. The European Union’s proposed AI Act could set a standard by forcing disclosure when copyrighted material is used. The US Federal Trade Commission, in a letter to OpenAI this month, demanded more information on “all sources of data” for GPT. We’ll see what that turns up.

In the meantime, content licensing agreements, such as the one recently entered into by the Associated Press and OpenAI, seem like a step in the right direction, though with the terms undisclosed it’s hard to know who benefits the most.

Unlike the all-smiles agreement by the AI companies on the White House voluntary measures — which should be reason enough to be suspicious of them — tougher data disclosure requirements won’t come without heavy resistance from Silicon Valley. Content creators and the tech titans are headed for a cultural collision. OpenAI Chief Executive Officer Sam Altman gave a recent taste, writing on Twitter: “Everything ‘creative’ is a remix of things that happened in the past.”

Expect this to become both the moral justification for scraping content at will, but also the legal foundation. Tech companies argue that such use of data can be covered under “fair use,” the legal doctrine that has long allowed for building on copyrighted works as an inspiration, subject to some stipulations over its intended use.

It’s becoming clear that protections designed to help creatives are at risk of being weaponized as a justification for not paying them; for not even telling them their work has been taken at all. We’re just starting to see this defense tested in court. It can only be a fair trial if AI companies are forced to be honest about how their technology really works.

About the Author(s)

Bloomberg News

The latest technology news from Bloomberg.

Sign up for the ITPro Today newsletter
Stay on top of the IT universe with commentary, news analysis, how-to's, and tips delivered to your inbox daily.

You May Also Like