There is no doubt that artificial intelligence (AI) has become a hot topic in recent times. While most of the attention tends to surround AI use (and its benefits and inherent risks), the question of where AI gets its data from and how it is being trained, is attracting greater attention.
This follows recent reports that a Melbourne based publishing company asked its authors to allow their work to be used to train AI. The move raises important questions about copyright and intellectual property.
What can be protected as intellectual property?
According to Rebekah Gay, Partner and Joint Global Head of Intellectual Property at Herbert Smith Freehills, there are different types of intellectual property including patents, trademarks, copyright and designs. “But the one that’s most relevant for AI is copyright … and copyright protects … what we call ‘works’ …
“[W]hat it doesn’t protect is ideas, but … the expression of ideas,” she says.
Isabella Alexander, professor in the Faculty of Law at the University of Technology Sydney (UTS), says that in the context of this issue, people refer to copyright which protects things like literary works, artistic works, books and illustrations. “[Y]ou get a bunch of rights with copyright, but the main one that people rely on is the right to reproduce and that’s the one … most important in the AI space because generative AI is built on enormous databases which have been used to train algorithms…,” she says.
Can AI infringe copyright?
In terms of whether AI can infringe copyright, Gay explains that when a machine is being trained, it is expected that it will need to duplicate or make a copy of the original work or document. She gives the example of an article. “[T]he article has to be uploaded and fed into the artificial intelligence tool. [I]f you don’t have the permission of the author to do that, then even that act of reproducing the article to feed it into the artificial intelligence tool is a reproduction of the article without permission. So that would be copyright infringement,” she says.
She also questions whether a result that is produced by an AI tool would constitute infringement. “I think in a lot of cases it would not be (an infringement) because what comes out of the artificial intelligence tool … a mash of things from a whole range of different sources,” she says. As such, it is unlikely that the result produced by the artificial intelligence tool falls within the definition of a “substantial reproduction of the original work…,” she says.
This is not the first time that AI companies have come under fire for using material without permission. There are a number of well-known cases overseas where companies, including publishing companies, media outlets, and individuals have taken AI companies to court over alleged copyright infringement.
Last week, it was reported that France’s top publishing and authors association filed a claim against Meta for their alleged use of copyright-protected content without consent to train its artificial intelligence systems.
For AI companies, the more diverse range of data inputs used to train AI the better. AI companies seek to use as many different sources of data as possible, including academic works. In mid-2024, it was reported that Informa, a large multi-billion-dollar, multinational company based in the United Kingdom, had entered into a deal with Microsoft to use its “advanced learning content and data.” The company publishes a diverse range of academic and technical books and journals.
Broader issues at play
Dr Sacha Molitorisz, senior lecturer at the Faculty of Law at UTS, says there are larger ethical/legal concerns to consider. “Gen AI comes in all sorts of different shapes and sizes … but the popularly used ones [like ChatGPT] … have in their outputs untruths because their sources are rich and varied…,” he says.
Bias and hallucinations are well known risks when it comes to AI but Molitorisz prefers a more straightforward term like “untruth.”
“They’re not lies because there’s no intention to deceive, but they’re definitely untrue and this can be really dangerous.
“We’ve seen lots of lawyers already get in trouble … with made up cases that they’ve cited in court or in documents…,” he says.
Molitorisz says information accuracy is crucial. “We live in this information economy and this world driven by information … When all these untruths creep in, the information that we are relying on is unreliable … it makes everything so much more difficult.”
He believes that Gen AI systems need to be accountable for the information they are putting out into the real world. “You can’t just spread untruths with impunity … we’ve seen that harms are being done,” he says.
And it’s not just lawyers who have been caught out using AI and citing made up cases. “Students have been using Gen AI and getting kicked out of uni or disciplined because there are cases that are made up or they’re coming up with untruths.
“I’ve read essays where there are journal articles that have been cited … I know the author’s name, I know the journal name but the article being cited doesn’t exist. Clearly, Gen AI has been used so there are harms being caused [and] quite significant ones,” he says.
Molitorisz says that academics and authors are “prime sources of excellent content” but there needs to be an acknowledgement of the sources and there needs to be a way in which authors receive ongoing attribution for use of their work.
He points out that if it took an author five years to write a book, but their research and writing is mixed in with all the “truths … untruths and other sources that are being used … that to me is a potential risk to us all as a society. … [T]o authors in terms of how their work is being used, I don’t think it … leads to a sustainable future for authors,” he says.
Current legislative framework and what’s on the horizon
According to Gay, at present there is no intellectual property legislation specific to AI. “The Copyright Act hasn’t been updated … to deal with AI. So you’re really just using the existing Copyright Act and existing legal principles and applying them to this new technology,” she says.
Gay explains the first step that authors or content creators must take is to ascertain whether their work has been used by a large language model for “training purposes”. She says there is a global debate over whether there needs to be greater transparency over the data sets used by AI so authors and content creators can ascertain whether their works have been used. The second question relates to jurisdiction and that raises questions of its own. She gives the example of the jurisdictional issues arising out of the dispute between Getty Images and Stability AI in the United Kingdom. Getty alleged in the High Court that Stability AI has used its images as data input to train one of the latter’s products. According to Gay, one of the defences being argued is that “it didn’t happen in the UK, it happened somewhere else.”
There is clearly a need for better legislative frameworks and systems to verify the data that is being used to train AI and ensure the accuracy and quality of the data being fed to train AI. Molitorisz says that things need to happen at the government level but admits that there are a lot of moving pieces. “We’ve got an election coming up in the next couple of months. … [T]hat might mean a kind of change in response [when it comes to] the risks for authors,” he says.
He also says that as an author, “If one of those deals were put to me. I would be reluctant to sign it.”