AI Chatbots are scraping news reporting and copyrighted content, News Media Alliance says

Press Room

2 years ago

Editor’s Note: A version of this article first appeared in the “Reliable Sources” newsletter. Sign up for the daily digest chronicling the evolving media landscape here.

A top news media trade group is calling out A.I. technology companies for scraping news material to train their chatbots.

The News Media Alliance, which represents nearly 2,000 outlets in the U.S., published research Tuesday that found developers of generative artificial intelligence systems, such as OpenAI and Google, “have copied and used news, magazine, and digital media content to train” their bots. Importantly, the research indicated that A.I.
companies have trained their bots to give far more credence to information published by those credible publishers versus material elsewhere across the web.

“The research and analysis we’ve conducted shows that AI companies and developers are not only engaging in unauthorized copying of our members’ content to train their products, but they are using it pervasively and to a greater extent than other sources,” said Danielle Coffey, chief executive of the News Media Alliance, in a statement.

“This shows they recognize our unique value, and yet most of these developers are not obtaining proper permissions through licensing agreements or compensating publishers for the use of this content,” Coffey added. “This diminishment of high-quality, human created content harms not only publishers but the sustainability of AI models themselves and the availability of reliable, trustworthy information.”

In the published white paper, the trade group also rejected arguments that A.I. bots have simply “learned” facts by reading various sets of data, like a human being would. The group said “it is inaccurate” to form such a conclusion “because models retain the expressions of facts that are contained in works in their copied training materials (and which copyright protects) without ever absorbing any underlying concepts.”

Publishers, many of which have been in a Cold War of sorts with A.I. companies, have started in recent months taking defensive measures to protect their content. In August, a Reliable Sources review found that a dozen major media companies have inserted code into their websites to safeguard their content from A.I.
bots that scrape the web for information. And many more have added it since.

But those defensive measures only protect news organizations from future scraping. The action would do nothing to address the prior scraping of their reporting, which the News Media Alliance — and others — said have been used to train A.I. bots.

To solve for that problem, the News Media Alliance outlined recommendations for news publishers to protect them from going extinct in this brave new world. The recommendations include policymakers recognizing that the unauthorized use of copyrighted material to train A.I. bots “is infringing” and that publishers should be able to “license the use of their content efficiently and on fair terms.”

“Our culture, our economy, and our democracy require a solution that allows the news and media industry to grow and flourish, and both to share in the profit from and participate in the development of the GAI revolution that is being built upon the fruits of its labor,” the News Media Alliance said.

Read the full article here