Jump to content

Artificial intelligence and copyright: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
+TDM exceptions in the EU / which country? / more accurate summary regarding these accurate summaries (per cited ref) / +LAION decision
 
(84 intermediate revisions by 42 users not shown)
Line 1: Line 1:
{{short description|How copyright law applies to the training and use of AI}}
{{Use mdy dates|date=April 2024}}{{short description|How copyright law applies to the training and use of AI}}
In the 2020s, the rapid increase in the capabilities of [[deep learning]]-based [[generative artificial intelligence]] models, including [[text-to-image model]]s such as [[Stable Diffusion]] and [[large language model]]s such as [[ChatGPT]], posed questions of how [[copyright law]] applies to the training and use of such models. Because there is limited existing case law, experts consider this area to be fraught with uncertainty.<ref name=verge/>
In the 2020s, the [[AI boom|rapid advancement]] of [[deep learning]]-based [[generative artificial intelligence]] models raised questions about whether [[copyright infringement]] occurs when such are trained or used. This includes [[text-to-image model]]s such as [[Stable Diffusion]] and [[large language model]]s such as [[ChatGPT]]. As of 2023, there were several pending U.S. lawsuits challenging the use of copyrighted data to train AI models, with defendants arguing that this falls under [[fair use]].<ref>{{Cite web |title=Artificial Intelligence Copyright Challenges in US Courts Surge |url=https://www.natlawreview.com/article/generative-ai-systems-tee-fair-use-fight |access-date=2024-03-19 |website=www.natlawreview.com |language=en}}</ref>


The largest issue regards whether infringement occurs when the generative AI is trained or used.<ref name=verge/> Popular deep learning models are generally trained on very large datasets of media [[web scraping|scraped]] from the Internet, much of which is copyrighted. Since the process of assembling training data involves making copies of copyrighted works it may violate the copyright holder's exclusive right to control the reproduction of their work, unless the use is covered by exceptions under a given jurisdiction's copyright statute. Additionally, the use of a model's outputs could be infringing, and the model creator may be accused of "vicarious liability" for said infringement. As of 2023, there are a number of pending US lawsuits challenging the use of copyrighted data to train AI models, with defendants arguing that this falls under [[fair use]].
Popular deep learning models are trained on mass amounts of media [[Web scraping|scraped]] from the Internet, often utilizing copyrighted material.<ref>{{Cite web |title=Primer: Training AI Models with Copyrighted Work |url=https://www.americanactionforum.org/insight/primer-training-ai-models-with-copyrighted-work/ |access-date=2024-03-19 |website=AAF |language=en-US}}</ref> When assembling training data, the sourcing of copyrighted works may infringe on the [[copyright holder]]'s exclusive right to control reproduction, unless covered by exceptions in relevant copyright laws. Additionally, using a model's outputs might violate copyright, and the model creator could be accused of [[vicarious liability]] and held responsible for that copyright infringement.


== Copyright status of AI-generated works ==
Another issue is that, in jurisdictions such as the US, output generated solely by a machine is ineligible for copyright protection, as most jurisdictions protect only "original" works having a human author. However, some have argued that the operator of an AI may qualify for copyright if they exercise sufficient originality in their use of an AI model.
[[File:Macaca nigra self-portrait large.jpg|thumb|The United States Copyright Office has declared that works not created by a human author, like [[Monkey selfie copyright dispute|this "selfie" portrait taken by a monkey]], are not eligible for copyright protection.]]
Since most legal jurisdictions only grant copyright to original works of authorship by human authors, the definition of "originality" is central to the copyright status of AI-generated works.<ref>{{Cite web |title=What is the Copyright Status of AI Generated Works? |url=https://www.linkedin.com/pulse/what-copyright-status-ai-generated-works-azrightsinternational |access-date=2024-03-19 |website=www.linkedin.com |language=en}}</ref>


=== United States ===
==Copyright status of AI-generated works==
In the U.S., the [[Copyright Act of 1976|Copyright Act]] protects "original works of authorship".<ref name="crs" /> The [[U.S. Copyright Office]] has interpreted this as being limited to works "created by a human being",<ref name="crs" /> declining to grant copyright to works generated without human intervention.<ref name="verge" /> Some have suggested that certain AI generations might be copyrightable in the U.S. and similar jurisdictions if it can be shown that the human who ran the AI program exercised sufficient originality in selecting the inputs to the AI or editing the AI's output.<ref name="verge" /><ref name="crs" />
[[File:Macaca nigra self-portrait large.jpg|thumb|The United States Copyright Office has declared that works not created by a human author, such as [[Monkey selfie copyright dispute|this "selfie" portrait taken by a monkey]], are not eligible for copyright protection.]]
Most legal jurisdictions grant copyright only to original works of authorship by human authors.<ref name=wipo/> In the US, the [[Copyright Act of 1976|Copyright Act]] protects "original works of authorship".<ref name=crs/> The [[U.S. Copyright Office]] has interpreted this as being limited to works "created by a human being",<ref name=crs/> declining to grant copyright to works generated solely by a machine.<ref name=verge/>


Proponents of this view suggest that an AI model may be viewed as merely a tool (akin to a pen or a camera) used by its human operator to express their creative vision.<ref name="crs" /><ref name="wipo" /> For example, proponents argue that if the standard of originality can be satisfied by an artist clicking the shutter button on a camera, then perhaps artists using generative AI should get similar deference, especially if they go through multiple rounds of revision to refine their prompts to the AI.<ref name=":0">{{cite news |title=Popular A.I. services for creating images are legal minefields for artists seeking payment for their work |url=https://fortune.com/2023/06/16/generative-a-i-copyright-law/ |access-date=21 June 2023 |work=Fortune |date=2023 |language=en}}
Some have suggested that certain AI generations might be copyrightable in the US and similar jurisdictions if it can be shown that the human who ran the AI program exercised sufficient originality in selecting the inputs to the AI or editing the AI's output.<ref name=verge/><ref name=crs/> Proponents of this view suggest that an AI model may be viewed as merely a tool (akin to a pen or a camera) used by its human operator to express their creative vision.<ref name=crs/><ref name=wipo/>
</ref> Other proponents argue that the Copyright Office is not taking a technology neutral approach to the use of AI or [[Algorithmic bias|algorithmic]] tools. For other creative expressions (music, photography, writing) the test is effectively whether there is ''[[de minimis]],'' or limited human creativity. For works using AI tools, the Copyright Office has made the test a different one i.e. whether there is no more than ''de minimis'' technological involvement.<ref name="Ramparts">Peter Pink-Howitt, [https://ramparts.gi/copyright-ai-and-creative-generative-works/ Copyright, AI And Generative Art], ''Ramparts'', 2023.</ref>


[[File:Théâtre D’opéra Spatial.png|thumb|''[[Théâtre D'opéra Spatial]]'', 2022, created using [[Midjourney]], prompted by Jason M. Allen]]
As AI is increasingly used to generate literature, music, and other forms of art, the US Copyright Office has released new guidance emphasizing whether works, including materials generated by artificial intelligence, exhibit a 'mechanical reproduction' nature or are the 'manifestation of the author's own creative conception'.<ref>{{Cite news |last=Yurkevich |first=Vanessa |date=2023-04-18 |title=Universal Music Group calls AI music a ‘fraud,’ wants it banned from streaming platforms. Experts say it’s not that easy |work=CNN |url=https://edition.cnn.com/2023/04/18/tech/universal-music-group-artificial-intelligence/index.html}}</ref>
This difference in approach can be seen in the recent decision in respect of a registration claim by Jason Matthew Allen for his work ''[[Théâtre D'opéra Spatial]]'' created using Midjourney and an upscaling tool. The Copyright Office stated: <blockquote>The Board finds that the Work contains more than a de minimis amount of content generated by artificial intelligence ("AI"), and this content must therefore be disclaimed in an application for registration. Because Mr. Allen is unwilling to disclaim the AI-generated material, the Work cannot be registered as submitted.<ref>Second Request for Reconsideration for Refusal to Register Théâtre D'opéra Spatial ([https://tmsnrt.rs/3Etzk4k SR # 1-11743923581; Correspondence ID: 1-5T5320R], 2023).</ref></blockquote>
As AI is increasingly used to generate literature, music, and other forms of art, the U.S. Copyright Office has released new guidance emphasizing whether works, including materials generated by artificial intelligence, exhibit a 'mechanical reproduction' nature or are the 'manifestation of the author's own creative conception'.<ref>{{Cite web |title=Federal Register :: Request Access |url=https://unblock.federalregister.gov/ |access-date=2024-03-20 |website=unblock.federalregister.gov}}</ref> The U.S. Copyright Office published a Rule in March 2023 on a range of issues related to the use of AI, where they stated:


<blockquote>...because the Office receives roughly half a million applications for registration each year, it sees new trends in registration activity that may require modifying or expanding the information required to be disclosed on an application.
Some jurisdictions include explicit statutory language related to computer-generated works, including the United Kingdom's [[Copyright, Designs and Patents Act 1988]], which states:

One such recent development is the use of sophisticated artificial intelligence ("AI") technologies capable of producing expressive material. These technologies "train" on vast quantities of preexisting human-authored works and use inferences from that training to generate new content. Some systems operate in response to a user's textual instruction, called a "prompt." 

The resulting output may be textual, visual, or audio, and is determined by the AI based on its design and the material it has been trained on. These technologies, often described as "generative AI," raise questions about whether the material they produce is protected by copyright, whether works consisting of both human-authored and AI-generated material may be registered, and what information should be provided to the Office by applicants seeking to register them.<ref>[https://www.federalregister.gov/documents/2023/03/16/2023-05321/copyright-registration-guidance-works-containing-material-generated-by-artificial-intelligence Copyright Registration Guidance: Works Containing Material Generated by Artificial Intelligence], US Copyright Office 2023.</ref></blockquote>

The [[United States Patent and Trademark Office|U.S. Patent and Trademark Office]] (USPTO) similarly codified restrictions on the [[patentability]] of patents credits solely to AI authors in February 2024, following an August 2023 ruling in the case ''Thaler v. Perlmutter''. In this case, the Patent Office denied grant to patents created by Stephen Thaler's AI program, [[DABUS]] due to the lack of a "natural person" on the patents' authorship. The [[U.S. Court of Appeals]] for the Federal Circuit upheld this decision.<ref>{{Cite web |title='Thaler v. Perlmutter': AI Output is Not Copyrightable |url=https://www.law.com/newyorklawjournal/2023/09/14/thaler-v-perlmutter-ai-output-is-not-copyrightable/ |access-date=2023-12-01 |website=New York Law Journal |language=en}}</ref><ref name=":1" /> In the subsequent rule-making, the USPTO allows for human inventors to incorporate the output of artificial intelligence, as long as this method is appropriately documented in the patent application.<ref>{{cite web | url=https://arstechnica.com/information-technology/2024/02/us-says-ai-models-cant-hold-patents/ | title=USPTO says AI models can't hold patents | date=14 February 2024 }}</ref> However, it may become virtually impossible as when the inner workings and the use of AI in inventive transactions are not adequately understood or are largely unknown.<ref name=":1">{{Cite journal |last=Valinasab |first=Omid |date=2023 |title=Big Data Analytics to Automate Disclosure of Artificial Intelligence's Inventions |url=https://usfblogs.usfca.edu/iptlj/files/2023/12/2.-VALINASAB-FINAL-Big-Data.pdf |journal=University of San Francisco Intellectual Property and Technology Law Journal |volume=27 |issue=2 |pages=133–140 |via=USF LJ}}</ref>

Representative [[Adam Schiff]] proposed the [[Generative AI Copyright Disclosure Act]] in April 2024. If passed, the bill would require AI companies to submit copyrighted works to the [[Register of Copyrights]] before releasing new generative AI systems. These companies would have to file these documents 30 days before publicly showing their AI tools.<ref>{{Cite web |title=New bill would force AI companies to reveal use of copyrighted art |url=https://amp.theguardian.com/technology/2024/apr/09/artificial-intelligence-bill-copyright-art |access-date=2024-04-13 |website=amp.theguardian.com}}</ref>

=== United Kingdom ===
Other jurisdictions include explicit statutory language related to computer-generated works, including the United Kingdom's [[Copyright, Designs and Patents Act 1988]], which states:


<blockquote>In the case of a literary, dramatic, musical or artistic work which is computer-generated, the author shall be taken to be the person by whom the arrangements necessary for the creation of the work are undertaken.<ref name=wipo/></blockquote>
<blockquote>In the case of a literary, dramatic, musical or artistic work which is computer-generated, the author shall be taken to be the person by whom the arrangements necessary for the creation of the work are undertaken.<ref name=wipo/></blockquote>


However, the computer generated work law under UK law relates to autonomous creations by computer programs. Individuals using AI tools will usually be the authors of the works assuming they meet the minimum requirements for copyright work. The language used for computer generated work relates, in respect of AI, to the ability of the human programmers to have copyright in the autonomous productions of the AI tools (i.e. where there is no direct human input): <blockquote>In so far as each composite frame is a computer generated work then the arrangements necessary for the creation of the work were undertaken by Mr Jones because he devised the appearance of the various elements of the game and the rules and logic by which each frame is generated and he wrote the relevant computer program. In these circumstances I am satisfied that Mr Jones is the person by whom the arrangements necessary for the creation of the works were undertaken and therefore is deemed to be the author by virtue of s.9(3)<ref>[https://www.casemine.com/judgement/uk/5a8ff74360d03e7f57eaa95f Nova Production v MazoomaGames [2006<nowiki>]</nowiki> EWHC 24 (Ch)].</ref> </blockquote>
However, such language is ambiguous as to whether it refers to the programmer who trained the model, or the user who operated the model to generate a particular output.<ref name=wipo/>

The UK government has consulted on the use of generative tools and AI in respect of intellectual property leading to a proposed specialist Code of Practice:<ref>[https://www.gov.uk/guidance/the-governments-code-of-practice-on-copyright-and-ai The UK government's code of practice on copyright and AI. ] UK Government 2023.</ref> "to provide guidance to support AI firms to access copyrighted work as an input to their models, whilst ensuring there are protections on generated output to support right holders of copyrighted work".<ref name="UKChanges">[https://www.gov.uk/government/consultations/artificial-intelligence-and-ip-copyright-and-patents/outcome/artificial-intelligence-and-intellectual-property-copyright-and-patents-government-response-to-consultation Artificial Intelligence and Intellectual Property: copyright and patents: Government response to consultation]. UK Government 2023.</ref> The U.S. Copyright Office recently{{When|date=November 2024}} published a [[Notice of proposed rulemaking|notice of inquiry]] and request for comments following its 2023 Registration Guidance.<ref>https://www.govinfo.gov/content/pkg/FR-2023-08-30/pdf/2023-18624.pdf</ref>

=== China ===
On November 27, 2023, the [[Beijing]] Internet Court issued a decision recognizing copyright in AI-generated images in a litigation.<ref>{{cite web|author=Aaron Wininger|url=https://www.natlawreview.com/article/beijing-internet-court-recognizes-copyright-ai-generated-images|title=Beijing Internet Court Recognizes Copyright in AI-Generated Images|work=The National Law Review|date=2023-11-29}}</ref>

As noted by a lawyer and AI art creator, the challenge for intellectual property regulators, legislators and the courts is how to protect human creativity in a technologically neutral fashion whilst considering the risks of automated AI factories. AI tools have the ability to autonomously create a range of material that is potentially subject to copyright (music, blogs, poetry, images, and technical papers) or other intellectual property rights (patents and design rights).<ref name="Ramparts" />

== Training AI with copyrighted data ==
Deep learning models [[Web scraping|source]] large data sets from the Internet such as publicly available images and the text of web pages. The text and images are then converted into [[Binary code|numeric formats]] the AI can analyze. A deep learning model identifies patterns linking the encoded text and image data and learns which text concepts correspond to elements in images. Through repetitive testing, the model refines its accuracy by matching images to text descriptions. The trained model undergoes validation to evaluate its skill in generating or manipulating new images using only the text prompts provided after the training process.<ref>{{Cite web |last=Takyar |first=Akash |date=2023-11-07 |title=Model validation techniques in machine learning |url=https://www.leewayhertz.com/model-validation-in-machine-learning/ |access-date=2024-03-20 |website=LeewayHertz - AI Development Company |language=en-US}}</ref> Because assembling these [[training dataset]]s involves making copies of copyrighted works, this has raised the question of whether this process infringes the copyright holder's exclusive right to make reproductions of their works, or if it falls use [[fair use]] allowances.<ref>{{cite journal | title = Generative AI Art: Copyright Infringement and Fair Use | first = Michael | last = Murray | volume = 26 | issue = 2 | journal = SMU Science and Technology Law Review | date = 2023 | page = 259 | doi = 10.25172/smustlr.26.2.4 }}</ref><ref>{{cite journal | title = Foundation Models and Fair Use | first1 = Peter | last1 = Henderson | first2 = Xuechen | last2 = Li | first3 = Dan | last3 = Jurafsky | first4 = Tatsunori | last4 = Hashimoto | first5 = Mark A. | last5 = Lemley | first6 = Percy | last6 = Liang | journal = Journal of Machine Learning Research | volume = 24 | issue = 400 | date = 2023 | pages = 1–79 | arxiv = 2303.15715 | url = http://jmlr.org/papers/v24/23-0569.html | accessdate = September 14, 2024 }}</ref>

===United States===
U.S. machine learning developers have traditionally believed this to be allowable under fair use because using copyrighted work is [[Transformative use|transformative]], and limited.<ref name="Vincent">{{Cite web |last=Vincent |first=James |date=2022-11-15 |title=The scary truth about AI copyright is nobody knows what will happen next |url=https://www.theverge.com/23444685/generative-ai-copyright-infringement-legal-fair-use-training-data |access-date=2024-03-20 |website=The Verge |language=en}}</ref> The situation has been compared to [[Google Books]]'s scanning of copyrighted books in ''[[Authors Guild, Inc. v. Google, Inc.]]'', which was ultimately found to be fair use, because the scanned content was not made publicly available, and the use was non-expressive.<ref>{{Cite web |last=Lee |first=Timothy B. |date=2023-04-03 |title=Stable Diffusion copyright lawsuits could be a legal earthquake for AI |url=https://arstechnica.com/tech-policy/2023/04/stable-diffusion-copyright-lawsuits-could-be-a-legal-earthquake-for-ai/ |access-date=2024-03-20 |website=Ars Technica |language=en-us}}</ref>

Timothy B. Lee, in ''[[Ars Technica]]'', argues that if the [[Plaintiff|plaintiffs]] succeed, this may shift the balance of power in favour of large corporations such as Google, Microsoft, and Meta which can afford to license large amounts of training data from copyright holders and leverage their proprietary datasets of user-generated data.<ref>{{Cite web |last=Lee |first=Timothy B. |date=2023-04-03 |title=Stable Diffusion copyright lawsuits could be a legal earthquake for AI |url=https://arstechnica.com/tech-policy/2023/04/stable-diffusion-copyright-lawsuits-could-be-a-legal-earthquake-for-ai/ |access-date=2024-03-20 |website=Ars Technica |language=en-us}}</ref> [[Intellectual property|IP]] scholars Bryan Casey and [[Mark Lemley]] argue in the ''[[Texas Law Review]]'' that datasets are so large that "there is no plausible option simply to license all [of the data...]. So allowing [any generative training] copyright claim is tantamount to saying, not that copyright owners will get paid, but that the use won't be permitted at all."<ref>{{Cite journal |last1=Lemley |first1=Mark A. |last2=Casey |first2=Bryan |date=2020 |title=Fair Learning |url=http://dx.doi.org/10.2139/ssrn.3528447 |journal=SSRN Electronic Journal |doi=10.2139/ssrn.3528447 |issn=1556-5068}}</ref> Other scholars disagree; some predict a similar outcome to the U.S. [[music licensing]] procedures.<ref name="Vincent"/>


Several jurisdictions have explicitly incorporated exceptions allowing for "text and [[data mining]]" (TDM) in their copyright statutes including the United Kingdom, Germany, Japan, and the EU.
==Training on copyrighted data==
Popular deep learning models are generally trained on very large datasets of media (such as publicly available images and the text of web pages) [[web scraping|scraped]] from the Internet, much of which is copyrighted. Because assembling these [[training dataset]]s involves making copies of copyrighted works, this has raised the question of whether this process infringes the copyright holders' exclusive right to make reproductions of their works.<ref name=crs/> Machine learning developers in the US have traditionally presumed this to be allowable under [[fair use]], because the use of copyrighted work is transformative, and limited.<ref name=crs/><ref name=verge/> The situation has been compared to [[Google Books]]'s scanning of copyrighted books in ''[[Authors Guild, Inc. v. Google, Inc.]]'', which was ultimately found to be fair use, because the scanned content was not made publicly available, and the use was non-expressive.<ref name=earthquake/>


===EU===
As of 2023, there were a number of US lawsuits disputing this, arguing that the training of machine learning models infringed the copyright of the authors of works contained in the training data.<ref name=crs/> Timothy B. Lee, in ''[[Ars Technica]]'', argues that if the plaintiffs succeed, this may shift the balance of power in favour of large corporations such as Google, Microsoft and Meta which can afford to license large amounts of training data from copyright holders and leverage their own proprietary datasets of user-generated data.<ref name=earthquake/>
In the EU, such TDM exceptions form part of the 2019 [[Directive on Copyright in the Digital Single Market]].<ref name=":2">{{Cite web |last=Goldstein |first=Paul |author-link=Paul Goldstein (law professor) |last2=Stuetzle |first2=Christiane |last3=Bischoff |first3=Susan |date=2024-11-13 |title=Kneschke vs. LAION - Landmark Ruling on TDM exceptions for AI training data – Part 1 |url=https://copyrightblog.kluweriplaw.com/2024/11/13/kneschke-vs-laion-landmark-ruling-on-tdm-exceptions-for-ai-training-data-part-1/ |access-date=2024-11-25 |website=Kluwer Copyright Blog |language=en-US}}</ref> They are specifically referred to in the EU's [[Artificial Intelligence Act|AI Act]] (which came into force in 2024), which "is widely seen as a clear indication of the EU legislator’s intention that the exception covers AI data collection", a view that was also endorsed in a 2024 German court decision.<ref name=":3">{{Cite web |last=Goldstein |first=Paul |author-link=Paul Goldstein (law professor) |last2=Stuetzle |first2=Christiane |last3=Bischoff |first3=Susan |date=2024-11-14 |title=Kneschke vs. LAION - Landmark Ruling on TDM exceptions for AI training data – Part 2 |url=https://copyrightblog.kluweriplaw.com/2024/11/14/kneschke-vs-laion-landmark-ruling-on-tdm-exceptions-for-ai-training-data-part-2/ |access-date=2024-11-25 |website=Kluwer Copyright Blog |language=en-US}}</ref> Unlike the TDM exception for scientific research, the more general exception covering commercial AI only applies if the copyright holder has not opted out.<ref name=":3" /> As of June 2023, a clause in the draft AI Act required generative AI to "make available summaries of the copyrighted material that was used to train their systems".<ref>{{Cite book |last=Mitsunaga |first=Takuho |chapter=Heuristic Analysis for Security, Privacy and Bias of Text Generative AI: GhatGPT-3.5 case as of June 2023 |date=2023-10-09 |pages=301–305 |title=2023 IEEE International Conference on Computing (ICOCO) |chapter-url=http://dx.doi.org/10.1109/icoco59262.2023.10397858 |publisher=IEEE |doi=10.1109/icoco59262.2023.10397858|isbn=979-8-3503-0268-4 }}</ref>


===UK===
A number of jurisdictions have explicitly incorporated exceptions allowing for "text and [[data mining]]" (TDM) in their copyright statutes including the United Kingdom, Germany, Japan, and the EU.<ref name=aiip/> Unlike the EU, the United Kingdom prohibits data mining for commercial purposes. As of June 2023, a clause in the draft EU [[AI Act]] would require generative AI to "make available summaries of the copyrighted material that was used to train their systems".<ref>{{cite news |last1=Rozen |first1=Miriam |title=Lawyers keep an eye on copyright risk with generative AI |url=https://www.ft.com/content/704d0bba-2653-4a27-bee1-ee45c6ed1080 |access-date=21 June 2023 |work=Financial Times |date=14 June 2023}}</ref>
Unlike the EU, the United Kingdom prohibits data mining for commercial purposes but has proposed this should be changed to support the development of AI: "For text and data mining, we plan to introduce a new copyright and database exception which allows TDM for any purpose. Rights holders will still have safeguards to protect their content, including a requirement for lawful access."<ref>{{Cite web |title=Artificial Intelligence and Intellectual Property: copyright and patents: Government response to consultation |url=https://www.gov.uk/government/consultations/artificial-intelligence-and-ip-copyright-and-patents/outcome/artificial-intelligence-and-intellectual-property-copyright-and-patents-government-response-to-consultation |access-date=2024-03-20 |website=GOV.UK |language=en}}</ref>


==Copyright infringing AI outputs==
==Copyright infringing AI outputs==
{{multiple image
{{multiple image
| width = 100
| total_width = 300
| footer = In rare cases, generative AI models may produce outputs that are virtually identical to images from their training set. The research paper from which this example was taken was able to produce similar replications for only 0.03% of training images.<ref name=earthquake/>
| footer = Generative AI models may produce outputs that are virtually identical to images from their training set. The research paper from which this example was taken was able to produce similar replications for only 0.03% of training images.<ref name=earthquake/>
| image1 = Anne Graham Lotz (October 2008).jpg
| image1 = Anne Graham Lotz (October 2008).jpg
| alt1 =
| alt1 =
Line 41: Line 72:
[[File:An astronaut riding a horse (Picasso) 2022-08-28.png|thumb|An image generated by [[Stable Diffusion]] using the prompt "an astronaut riding a horse, by Picasso". Generative image models are adept at imitating the visual style of particular artists in their training set.]]
[[File:An astronaut riding a horse (Picasso) 2022-08-28.png|thumb|An image generated by [[Stable Diffusion]] using the prompt "an astronaut riding a horse, by Picasso". Generative image models are adept at imitating the visual style of particular artists in their training set.]]


In some cases, deep learning models may replicate items in their training set when generating output. This behaviour is generally considered an undesired [[overfitting]] of a model by AI developers, and has in previous generations of AI been considered a manageable problem.<ref>See for example OpenAI's comment in the year of [[GPT-2]]'s release: {{cite report |author=OpenAI |date=2019 |title=Comment Regarding Request for Comments on Intellectual Property Protection for Artificial Intelligence Innovation |docket=PTO–C–2019–0038 |publisher=United States Patent and Trademark Office |url=https://www.uspto.gov/sites/default/files/documents/OpenAI_RFC-84-FR-58141.pdf#page=9 |page=9 |quote=Well-constructed AI systems generally do not regenerate, in any nontrivial portion, unaltered data from any particular work in their training corpus}}</ref> [[Large language model#Memorization and copyright|Memorization]] is the emergent phenomenon of LLMs to repeat long strings of training data, and it is no longer related to overfitting.<ref>{{harvnb|Hans|Wen|Jain|Kirchenbauer|2024|loc=§2.3}}</ref> Evaluations of controlled LLM output measure the amount memorized from training data (focused on [[GPT-2]]-series models) as variously over 1% for exact duplicates<ref>{{cite journal |last1=Peng |first1=Zhencan |last2=Wang |first2=Zhizhi |last3=Deng |first3=Dong |title=Near-Duplicate Sequence Search at Scale for Large Language Model Memorization Evaluation |journal=Proceedings of the ACM on Management of Data |date=13 June 2023 |volume=1 |issue=2 |pages=1–18 |doi=10.1145/3589324 |s2cid=259213212 |url=https://people.cs.rutgers.edu/~dd903/assets/papers/sigmod23.pdf |access-date=2024-01-20 |archive-date=2024-08-27 |archive-url=https://web.archive.org/web/20240827053753/https://people.cs.rutgers.edu/~dd903/assets/papers/sigmod23.pdf |url-status=live }} Citing Lee et al 2022.</ref> or up to about 7%.<ref>{{harvnb|Peng|Wang|Deng|2023|p=8}}.</ref> This is potentially a security risk and a copyright risk, for both users and providers.<ref name=Hans2024>{{cite arXiv |last1=Hans |first1=Abhimanyu |last2=Wen |first2=Yuxin |last3=Jain |first3=Neel |last4=Kirchenbauer |first4=John |last5=Kazemi |first5=Hamid |last6=Singhania |first6=Prajwal |last7=Singh |first7=Siddharth |last8=Somepalli |first8=Gowthami |last9=Geiping |first9=Jonas |last10=Bhatele |first10=Abhinav |last11=Goldstein |first11=Tom |title=Be like a Goldfish, Don't Memorize! Mitigating Memorization in Generative LLMs |date=2024-06-14 |eprint=2406.10209 |class=cs.CL |postscript=, <!--|doi=10.48550/arXiv.2406.10209-->}} §1.</ref> {{Asof|August 2023}}, major consumer LLMs have attempted to mitigate these problems, but researchers have still been able to prompt leakage of copyrighted material.<ref>{{cite web |last=Hays |first=Kali |date=2023-08-15 |title=ByteDance AI researchers say OpenAI now tries to hide that ChatGPT was trained on J.K. Rowling's copyrighted Harry Potter books |work=Business Insider |url=https://www.businessinsider.com/openais-latest-chatgpt-version-hides-training-on-copyrighted-material-2023-8 |access-date=2024-09-15 |postscript=,}} citing {{cite arXiv |last1=Liu |first1=Yang |last2=Yao |first2=Yuanshun |last3=Ton |first3=Jean-Francois |last4=Zhang |first4=Xiaoying |last5=Guo |first5=Ruocheng |last6=Cheng |first6=Hao |last7=Klochkov |first7=Yegor |last8=Taufiq |first8=Muhammad Faaiz |last9=Li |first9=Hang |title=Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment |date=2023-08-10 |eprint=2308.05374v2 |class=cs.AI <!--|doi=10.48550/arXiv.2308.05374--> <!--also url=https://huggingface.co/papers/2308.05374 -->}}</ref>
In some cases, deep learning models may "memorize" the details of particular items in their training set, and reproduce them at generation time, such that their outputs may be considered copyright infringement. This behaviour is generally considered undesirable by AI developers (being considered a form of [[overfitting]]), and disagreement exists as to how prevalent this behaviour is in modern systems. [[OpenAI]] has argued that "well-constructed AI systems generally do not regenerate, in any nontrivial portion, unaltered data from any particular work in their training corpus".<ref name=crs/> Under US law, to prove that an AI output infringes a copyright, a plaintiff must show the copyrighted work was "actually copied", meaning that the AI generates output which is "substantially similar" to their work, and that the AI had access to their work.<ref name=crs/>


Under U.S. law, to prove that an AI output infringes a copyright, a plaintiff must show the copyrighted work was "actually copied", meaning that the AI generates output which is "substantially similar" to their work, and that the AI had access to their work.<ref name="crs" />
Since [[Copyright protection for fictional characters|fictional characters enjoy some copyright protection]] in the US and other jurisdictions, an AI may also produce infringing content in the form of novel works which incorporate fictional characters.<ref name=crs/><ref name=earthquake/>


In the course of learning to statistically model the data on which they are trained, deep generative AI models may learn to imitate the distinct style of particular authors in the training set. For example, a generative image model such as [[Stable Diffusion]] is able to model the stylistic characteristics of an artist like [[Pablo Picasso]] (including his particular brush strokes, use of colour, perspective, and so on), and a user can engineer a prompt such as "an astronaut riding a horse, by Picasso" to cause the model to generate a novel image applying the artist's style to an arbitrary subject. However, an artist's overall style is generally not subject to copyright protection.<ref name=crs/>
In the course of learning to statistically model the data on which they are trained, deep generative AI models may learn to imitate the distinct style of particular authors in the training set. Since [[Copyright protection for fictional characters|fictional characters enjoy some copyright protection]] in the U.S. and other jurisdictions, an AI may also produce infringing content in the form of novel works which incorporate fictional characters.<ref name="crs" /><ref name="earthquake" />
A generative image model such as Stable Diffusion is able to model the stylistic characteristics of an artist like [[Pablo Picasso]] (including his particular brush strokes, use of colour, perspective, and so on), and a user can engineer a prompt such as "an astronaut riding a horse, by Picasso" to cause the model to generate a novel image applying the artist's style to an arbitrary subject. However, an artist's overall style is generally not subject to copyright protection.<ref name="crs" />


==Litigation==
==Litigation==
* A November 2022 class action lawsuit against [[Microsoft]], [[GitHub]] and [[OpenAI]] alleged that [[GitHub Copilot]], an AI-powered code editing tool trained on public GitHub repositories, violated the copyright of the repositories' authors, noting that the tool was able to generate source code which matched its training data verbatim, without providing attribution.<ref name="Verge copilot"/>
* A November 2022 class action lawsuit against [[Microsoft]], [[GitHub]] and [[OpenAI]] alleged that [[GitHub Copilot]], an AI-powered code editing tool trained on public GitHub repositories, violated the copyright of the repositories' authors, noting that the tool was able to generate source code which matched its training data verbatim, without providing attribution.<ref name="Verge copilot"/>
* In January 2023 three artists — [[Sarah Andersen]], Kelly McKernan, and Karla Ortiz — filed a class action [[copyright infringement]] lawsuit against [[Stability AI]], [[Midjourney]], and [[DeviantArt]], claiming that these companies have infringed the rights of millions of artists by training AI tools on five billion images scraped from the web without the consent of the original artists.<ref name=verge3/> The plaintiffs' complaint has been criticized for technical inaccuracies, such as incorrectly claiming that "a trained diffusion model can produce a copy of any of its Training Images", and describing Stable Diffusion as "merely a complex collage tool".<ref name=ars3/> In addition to copyright infringement, the plaintiffs allege unlawful competition and violation of their [[right of publicity]] in relation to AI tools' ability to create works in the style of the plaintiffs ''en masse''.<ref name=ars3/>
* In January 2023 three US artists—[[Sarah Andersen]], [[Kelly McKernan]], and Karla Ortiz—filed a class action [[copyright infringement]] lawsuit against [[Stability AI]], [[Midjourney]], and [[DeviantArt]], claiming that these companies have infringed the rights of millions of artists by training AI tools on five billion images scraped from the web without the consent of the original artists.<ref name=verge3/> The plaintiffs' complaint has been criticized for technical inaccuracies, such as incorrectly claiming that "a trained diffusion model can produce a copy of any of its Training Images", and describing Stable Diffusion as "merely a complex collage tool".<ref name=ars3/> In addition to copyright infringement, the plaintiffs allege unlawful competition and violation of their [[right of publicity]] in relation to AI tools' ability to create works in the style of the plaintiffs ''en masse''.<ref name=ars3/> In July 2023, U.S. District Judge [[William Orrick III|William Orrick]] inclined to dismiss most of the lawsuit filed by Andersen, McKernan, and Ortiz but allowed them to file a new complaint.<ref>{{Cite news |last=Brittain |first=Blake |date=2023-07-19 |title=US judge finds flaws in artists' lawsuit against AI companies |language=en |work=Reuters |url=https://www.reuters.com/legal/litigation/us-judge-finds-flaws-artists-lawsuit-against-ai-companies-2023-07-19/ |access-date=2023-08-06}}</ref> Judge Orrick later dismissed all but one claim, that of copyright infringement towards Stability AI, in October 2023.<ref>{{cite web | url = https://www.hollywoodreporter.com/business/business-news/artists-copyright-infringement-case-ai-art-generators-1235632929/ | title = Artists Lose First Round of Copyright Infringement Case Against AI Art Generators | first = Winston | last = Cho | date = October 30, 2023 | accessdate = April 30, 2024 | work = [[Hollywood Reporter]] }}</ref> However, after refiling on some of the eliminated claims, Orrick agreed in August 2024 to include some of these additional claims against the AI companies, which included both copyright and trademark infringements.<ref>{{cite web | url = https://www.theverge.com/2024/8/13/24219520/stability-midjourney-artist-lawsuit-copyright-trademark-claims-approved | title = Artists' lawsuit against Stability AI and Midjourney gets more punch | first = Adi | last = Robertson | date = August 13, 2024 | accessdate = August 13, 2024 | work = [[The Verge]] }}</ref>
* In January 2023, Stability AI was sued in London by [[Getty Images]] for using its images in their training data without purchasing a license.<ref name="cnn-getty"/><ref name="GettyPress23"/>
* In January 2023, Stability AI was sued in London by [[Getty Images]] for using its images in their training data without purchasing a license.<ref name="cnn-getty"/><ref name="GettyPress23"/>
* Getty filed another suit against Stability AI in a US district court in Delaware in February 2023. The suit again alleges copyright infringement for the use of Getty's images in the training of Stable Diffusion, and further argues that the model infringes Getty's [[trademark]] by generating images with Getty's [[watermark]].<ref name=ars-getty/>
* Getty filed another suit against Stability AI in a U.S. district court in Delaware in February 2023. The suit again alleges copyright infringement for the use of Getty's images in the training of Stable Diffusion, and further argues that the model infringes Getty's [[trademark]] by generating images with Getty's [[watermark]].<ref name=ars-getty/>
* In July 2023, authors [[Paul Tremblay]] and [[Mona Awad]] filed a lawsuit in a San Francisco court against OpenAI, alleging that its ChatGPT language model had been trained on their copyrighted books without permission, citing ChatGPT's "very accurate" summaries of their works as evidence.<ref>{{Cite web |last=Ngila |first=Faustine |date=2023-07-06 |title=The copyright battles against OpenAI have begun |url=https://qz.com/openai-lawsuit-copyright-books-chatgpt-generative-ai-1850609334 |access-date=2024-11-25 |website=Quartz |language=en}}</ref><ref>{{Cite web |last=Kris |first=Jimmy |date=2023-07-06 |title=OpenAI faces copyright lawsuit from authors Mona Awad and Paul Tremblay |url=https://dailyai.com/2023/07/openai-faces-copyright-lawsuit-from-authors-mona-awad-and-paul-tremblay/ |access-date=2023-07-10 |website=DailyAi |language=en}}</ref> Two separate lawsuits were filed by authors [[Sarah Silverman]], [[Christopher Golden]] and [[Richard Kadrey]] against Meta and OpenAI, arguing that in addition to copyright infringement for training their engines on their works, that products produced from the AI engines were derivative works and also copyright infringements.<ref>{{Cite web | url = https://www.theverge.com/2023/7/9/23788741/sarah-silverman-openai-meta-chatgpt-llama-copyright-infringement-chatbots-artificial-intelligence-ai | title = Sarah Silverman is suing OpenAI and Meta for copyright infringement | first = Wes | last = Davis | date = July 9, 2023 | accessdate = April 30, 2024 | work = [[The Verge]] }}</ref> The two suits against OpenAI were combined (during which Awad left the suit) and by February 2024, Judge [[Araceli Martínez-Olguín]] of the Northern District of California threw out all but one claim related to the use of the author's copyrighted works as part of the training data for the AI model.<ref>{{cite web | url = https://www.theverge.com/2024/2/13/24072131/sarah-silverman-paul-tremblay-openai-chatgpt-copyright-lawsuit | title = Sarah Silverman's lawsuit against OpenAI partially dismissed | first = Emilla | last = David | date = February 13, 2024 | accessdate = April 30, 2024 | work = [[The Verge]] }}</ref>
* ''[[The New York Times]]'' has sued Microsoft and OpenAI in December 2023, claiming that their engines were trained on wholesale articles from the ''Times'', which the ''Times'' considers infringement of their copyright. The ''Times'' further claimed that fair use claims made by these AI companies were invalid since the generated information around news stories directly competes with the ''Times'' and impacts the newspaper's commercial opportunities.<ref>{{cite news | url = https://apnews.com/article/openai-new-york-times-chatgpt-lawsuit-grisham-nyt-69f78c404ace42c0070fdfb9dd4caeb7 | title = New York Times and authors on 'fair use' of copyrighted works | first = Matt | last = O'Brien | date = January 10, 2024 | accessdate = April 30, 2024 | work = [[Associated Press News]] }}</ref>
* Eight U.S. national newspapers owned by [[Tribune Publishing]] sued Microsoft and OpenAI in April 2024 over copyright infringement related to the use of their news articles for training data, as well as for output that creates false and misleading statements that are attributed to the newspapers.<ref>{{cite news | url = https://www.npr.org/2024/04/30/1248141220/lawsuit-openai-microsoft-copyright-infringement-newspaper-tribune-post | title = Eight newspapers sue OpenAI, Microsoft for copyright infringement | first = Bobby | last = Allyn | date = April 30, 2024 | accessdate = April 30, 2024 | work = [[NPR]] }}</ref>
* The [[Recording Industry Association of America]] (RIAA) and several major music labels sued the developers of [[Suno AI]] and [[Udio]], AI models that can take text input to create songs with both lyrics and backing music, in separate lawsuits in June 2024, alledging that both AI models were trained without consent with music from the labels.<ref>{{cite web | url = https://www.theverge.com/2024/6/24/24184710/riaa-ai-lawsuit-suno-udio-copyright-umg-sony-warner | title = Major record labels sue AI company behind 'BBL Drizzy' | first = Mia | last = Sato | date = June 24, 2024 | access-date = June 24, 2024 | work = [[The Verge]] }}</ref>
* In September 2024, the [[Regional Court of Hamburg]] dismissed a German photographer's lawsuit against the non-profit organization [[LAION]] for unauthorized reproduction of his copyrighted work while creating a dataset for AI training.<ref name=":2" /> The decision was described as a "landmark ruling on TDM exceptions for AI training data" in Germany and EU more generally.<ref name=":2" />


==References==
==References==
Line 76: Line 114:
|url=https://crsreports.congress.gov/product/pdf/LSB/LSB10922
|url=https://crsreports.congress.gov/product/pdf/LSB/LSB10922
}}</ref>
}}</ref>

<ref name=aiip>{{cite book
<!-- <ref name=aiip>{{cite book
|title=Artificial Intelligence and Intellectual Property
|title=Artificial Intelligence and Intellectual Property
|editor-last1=Lee|editor-first1=Jyh-An|editor-last2=Hilty|editor-first2=Reto|editor-last3=Liu|editor-first3=Kung-Chung
|editor-last1=Lee|editor-first1=Jyh-An|editor-last2=Hilty|editor-first2=Reto|editor-last3=Liu|editor-first3=Kung-Chung
Line 82: Line 121:
|doi=10.1093/oso/9780198870944.001.0001
|doi=10.1093/oso/9780198870944.001.0001
|year=2021
|year=2021
|isbn=978-0-19-887094-4 }}</ref>
|isbn=978-0-19-887094-4 }}</ref> -->

<ref name=earthquake>{{cite web
<ref name=earthquake>{{cite web
|work=Ars Technica
|work=Ars Technica
Line 111: Line 151:


==External links==
==External links==
* [[Pamela Samuelson]]: [https://www.youtube.com/watch?v=S7Zp_vGUnrY Will Copyright Derail Generative AI Technologies?] (Presentation at a [[Simons Institute for the Theory of Computing|Simons Institute]] workshop on "Alignment, Trust, Watermarking, and Copyright Issues in LLMs", October 17, 2024) - overview over 32 ongoing lawsuits in the US at the time
* [https://dockets.justia.com/docket/delaware/dedce/1:2023cv00135/81407 Getty Images (US), Inc. v. Stability AI, Inc.] filings
* [https://www.wipo.int/edocs/pubdocs/en/wipo-pub-2003-en-getting-the-innovation-ecosystem-ready-for-ai.pdf Getting the Innovation Ecosystem Ready for AI: An IP policy toolkit] – WIPO 2024


[[Category:Artificial intelligence]]
[[Category:Artificial intelligence]]

Latest revision as of 05:12, 25 November 2024

In the 2020s, the rapid advancement of deep learning-based generative artificial intelligence models raised questions about whether copyright infringement occurs when such are trained or used. This includes text-to-image models such as Stable Diffusion and large language models such as ChatGPT. As of 2023, there were several pending U.S. lawsuits challenging the use of copyrighted data to train AI models, with defendants arguing that this falls under fair use.[1]

Popular deep learning models are trained on mass amounts of media scraped from the Internet, often utilizing copyrighted material.[2] When assembling training data, the sourcing of copyrighted works may infringe on the copyright holder's exclusive right to control reproduction, unless covered by exceptions in relevant copyright laws. Additionally, using a model's outputs might violate copyright, and the model creator could be accused of vicarious liability and held responsible for that copyright infringement.

[edit]
The United States Copyright Office has declared that works not created by a human author, like this "selfie" portrait taken by a monkey, are not eligible for copyright protection.

Since most legal jurisdictions only grant copyright to original works of authorship by human authors, the definition of "originality" is central to the copyright status of AI-generated works.[3]

United States

[edit]

In the U.S., the Copyright Act protects "original works of authorship".[4] The U.S. Copyright Office has interpreted this as being limited to works "created by a human being",[4] declining to grant copyright to works generated without human intervention.[5] Some have suggested that certain AI generations might be copyrightable in the U.S. and similar jurisdictions if it can be shown that the human who ran the AI program exercised sufficient originality in selecting the inputs to the AI or editing the AI's output.[5][4]

Proponents of this view suggest that an AI model may be viewed as merely a tool (akin to a pen or a camera) used by its human operator to express their creative vision.[4][6] For example, proponents argue that if the standard of originality can be satisfied by an artist clicking the shutter button on a camera, then perhaps artists using generative AI should get similar deference, especially if they go through multiple rounds of revision to refine their prompts to the AI.[7] Other proponents argue that the Copyright Office is not taking a technology neutral approach to the use of AI or algorithmic tools. For other creative expressions (music, photography, writing) the test is effectively whether there is de minimis, or limited human creativity. For works using AI tools, the Copyright Office has made the test a different one i.e. whether there is no more than de minimis technological involvement.[8]

Théâtre D'opéra Spatial, 2022, created using Midjourney, prompted by Jason M. Allen

This difference in approach can be seen in the recent decision in respect of a registration claim by Jason Matthew Allen for his work Théâtre D'opéra Spatial created using Midjourney and an upscaling tool. The Copyright Office stated:

The Board finds that the Work contains more than a de minimis amount of content generated by artificial intelligence ("AI"), and this content must therefore be disclaimed in an application for registration. Because Mr. Allen is unwilling to disclaim the AI-generated material, the Work cannot be registered as submitted.[9]

As AI is increasingly used to generate literature, music, and other forms of art, the U.S. Copyright Office has released new guidance emphasizing whether works, including materials generated by artificial intelligence, exhibit a 'mechanical reproduction' nature or are the 'manifestation of the author's own creative conception'.[10] The U.S. Copyright Office published a Rule in March 2023 on a range of issues related to the use of AI, where they stated:

...because the Office receives roughly half a million applications for registration each year, it sees new trends in registration activity that may require modifying or expanding the information required to be disclosed on an application.

One such recent development is the use of sophisticated artificial intelligence ("AI") technologies capable of producing expressive material. These technologies "train" on vast quantities of preexisting human-authored works and use inferences from that training to generate new content. Some systems operate in response to a user's textual instruction, called a "prompt." 

The resulting output may be textual, visual, or audio, and is determined by the AI based on its design and the material it has been trained on. These technologies, often described as "generative AI," raise questions about whether the material they produce is protected by copyright, whether works consisting of both human-authored and AI-generated material may be registered, and what information should be provided to the Office by applicants seeking to register them.[11]

The U.S. Patent and Trademark Office (USPTO) similarly codified restrictions on the patentability of patents credits solely to AI authors in February 2024, following an August 2023 ruling in the case Thaler v. Perlmutter. In this case, the Patent Office denied grant to patents created by Stephen Thaler's AI program, DABUS due to the lack of a "natural person" on the patents' authorship. The U.S. Court of Appeals for the Federal Circuit upheld this decision.[12][13] In the subsequent rule-making, the USPTO allows for human inventors to incorporate the output of artificial intelligence, as long as this method is appropriately documented in the patent application.[14] However, it may become virtually impossible as when the inner workings and the use of AI in inventive transactions are not adequately understood or are largely unknown.[13]

Representative Adam Schiff proposed the Generative AI Copyright Disclosure Act in April 2024. If passed, the bill would require AI companies to submit copyrighted works to the Register of Copyrights before releasing new generative AI systems. These companies would have to file these documents 30 days before publicly showing their AI tools.[15]

United Kingdom

[edit]

Other jurisdictions include explicit statutory language related to computer-generated works, including the United Kingdom's Copyright, Designs and Patents Act 1988, which states:

In the case of a literary, dramatic, musical or artistic work which is computer-generated, the author shall be taken to be the person by whom the arrangements necessary for the creation of the work are undertaken.[6]

However, the computer generated work law under UK law relates to autonomous creations by computer programs. Individuals using AI tools will usually be the authors of the works assuming they meet the minimum requirements for copyright work. The language used for computer generated work relates, in respect of AI, to the ability of the human programmers to have copyright in the autonomous productions of the AI tools (i.e. where there is no direct human input):

In so far as each composite frame is a computer generated work then the arrangements necessary for the creation of the work were undertaken by Mr Jones because he devised the appearance of the various elements of the game and the rules and logic by which each frame is generated and he wrote the relevant computer program. In these circumstances I am satisfied that Mr Jones is the person by whom the arrangements necessary for the creation of the works were undertaken and therefore is deemed to be the author by virtue of s.9(3)[16]

The UK government has consulted on the use of generative tools and AI in respect of intellectual property leading to a proposed specialist Code of Practice:[17] "to provide guidance to support AI firms to access copyrighted work as an input to their models, whilst ensuring there are protections on generated output to support right holders of copyrighted work".[18] The U.S. Copyright Office recently[when?] published a notice of inquiry and request for comments following its 2023 Registration Guidance.[19]

China

[edit]

On November 27, 2023, the Beijing Internet Court issued a decision recognizing copyright in AI-generated images in a litigation.[20]

As noted by a lawyer and AI art creator, the challenge for intellectual property regulators, legislators and the courts is how to protect human creativity in a technologically neutral fashion whilst considering the risks of automated AI factories. AI tools have the ability to autonomously create a range of material that is potentially subject to copyright (music, blogs, poetry, images, and technical papers) or other intellectual property rights (patents and design rights).[8]

Training AI with copyrighted data

[edit]

Deep learning models source large data sets from the Internet such as publicly available images and the text of web pages. The text and images are then converted into numeric formats the AI can analyze. A deep learning model identifies patterns linking the encoded text and image data and learns which text concepts correspond to elements in images. Through repetitive testing, the model refines its accuracy by matching images to text descriptions. The trained model undergoes validation to evaluate its skill in generating or manipulating new images using only the text prompts provided after the training process.[21] Because assembling these training datasets involves making copies of copyrighted works, this has raised the question of whether this process infringes the copyright holder's exclusive right to make reproductions of their works, or if it falls use fair use allowances.[22][23]

United States

[edit]

U.S. machine learning developers have traditionally believed this to be allowable under fair use because using copyrighted work is transformative, and limited.[24] The situation has been compared to Google Books's scanning of copyrighted books in Authors Guild, Inc. v. Google, Inc., which was ultimately found to be fair use, because the scanned content was not made publicly available, and the use was non-expressive.[25]

Timothy B. Lee, in Ars Technica, argues that if the plaintiffs succeed, this may shift the balance of power in favour of large corporations such as Google, Microsoft, and Meta which can afford to license large amounts of training data from copyright holders and leverage their proprietary datasets of user-generated data.[26] IP scholars Bryan Casey and Mark Lemley argue in the Texas Law Review that datasets are so large that "there is no plausible option simply to license all [of the data...]. So allowing [any generative training] copyright claim is tantamount to saying, not that copyright owners will get paid, but that the use won't be permitted at all."[27] Other scholars disagree; some predict a similar outcome to the U.S. music licensing procedures.[24]

Several jurisdictions have explicitly incorporated exceptions allowing for "text and data mining" (TDM) in their copyright statutes including the United Kingdom, Germany, Japan, and the EU.

EU

[edit]

In the EU, such TDM exceptions form part of the 2019 Directive on Copyright in the Digital Single Market.[28] They are specifically referred to in the EU's AI Act (which came into force in 2024), which "is widely seen as a clear indication of the EU legislator’s intention that the exception covers AI data collection", a view that was also endorsed in a 2024 German court decision.[29] Unlike the TDM exception for scientific research, the more general exception covering commercial AI only applies if the copyright holder has not opted out.[29] As of June 2023, a clause in the draft AI Act required generative AI to "make available summaries of the copyrighted material that was used to train their systems".[30]

UK

[edit]

Unlike the EU, the United Kingdom prohibits data mining for commercial purposes but has proposed this should be changed to support the development of AI: "For text and data mining, we plan to introduce a new copyright and database exception which allows TDM for any purpose. Rights holders will still have safeguards to protect their content, including a requirement for lawful access."[31]

[edit]
A photograph of Anne Graham Lotz included in Stable Diffusion's training set
An image generated by Stable Diffusion using the prompt "Anne Graham Lotz"
Generative AI models may produce outputs that are virtually identical to images from their training set. The research paper from which this example was taken was able to produce similar replications for only 0.03% of training images.[32]
An image generated by Stable Diffusion using the prompt "an astronaut riding a horse, by Picasso". Generative image models are adept at imitating the visual style of particular artists in their training set.

In some cases, deep learning models may replicate items in their training set when generating output. This behaviour is generally considered an undesired overfitting of a model by AI developers, and has in previous generations of AI been considered a manageable problem.[33] Memorization is the emergent phenomenon of LLMs to repeat long strings of training data, and it is no longer related to overfitting.[34] Evaluations of controlled LLM output measure the amount memorized from training data (focused on GPT-2-series models) as variously over 1% for exact duplicates[35] or up to about 7%.[36] This is potentially a security risk and a copyright risk, for both users and providers.[37] As of August 2023, major consumer LLMs have attempted to mitigate these problems, but researchers have still been able to prompt leakage of copyrighted material.[38]

Under U.S. law, to prove that an AI output infringes a copyright, a plaintiff must show the copyrighted work was "actually copied", meaning that the AI generates output which is "substantially similar" to their work, and that the AI had access to their work.[4]

In the course of learning to statistically model the data on which they are trained, deep generative AI models may learn to imitate the distinct style of particular authors in the training set. Since fictional characters enjoy some copyright protection in the U.S. and other jurisdictions, an AI may also produce infringing content in the form of novel works which incorporate fictional characters.[4][32]

A generative image model such as Stable Diffusion is able to model the stylistic characteristics of an artist like Pablo Picasso (including his particular brush strokes, use of colour, perspective, and so on), and a user can engineer a prompt such as "an astronaut riding a horse, by Picasso" to cause the model to generate a novel image applying the artist's style to an arbitrary subject. However, an artist's overall style is generally not subject to copyright protection.[4]

Litigation

[edit]
  • A November 2022 class action lawsuit against Microsoft, GitHub and OpenAI alleged that GitHub Copilot, an AI-powered code editing tool trained on public GitHub repositories, violated the copyright of the repositories' authors, noting that the tool was able to generate source code which matched its training data verbatim, without providing attribution.[39]
  • In January 2023 three US artists—Sarah Andersen, Kelly McKernan, and Karla Ortiz—filed a class action copyright infringement lawsuit against Stability AI, Midjourney, and DeviantArt, claiming that these companies have infringed the rights of millions of artists by training AI tools on five billion images scraped from the web without the consent of the original artists.[40] The plaintiffs' complaint has been criticized for technical inaccuracies, such as incorrectly claiming that "a trained diffusion model can produce a copy of any of its Training Images", and describing Stable Diffusion as "merely a complex collage tool".[41] In addition to copyright infringement, the plaintiffs allege unlawful competition and violation of their right of publicity in relation to AI tools' ability to create works in the style of the plaintiffs en masse.[41] In July 2023, U.S. District Judge William Orrick inclined to dismiss most of the lawsuit filed by Andersen, McKernan, and Ortiz but allowed them to file a new complaint.[42] Judge Orrick later dismissed all but one claim, that of copyright infringement towards Stability AI, in October 2023.[43] However, after refiling on some of the eliminated claims, Orrick agreed in August 2024 to include some of these additional claims against the AI companies, which included both copyright and trademark infringements.[44]
  • In January 2023, Stability AI was sued in London by Getty Images for using its images in their training data without purchasing a license.[45][46]
  • Getty filed another suit against Stability AI in a U.S. district court in Delaware in February 2023. The suit again alleges copyright infringement for the use of Getty's images in the training of Stable Diffusion, and further argues that the model infringes Getty's trademark by generating images with Getty's watermark.[47]
  • In July 2023, authors Paul Tremblay and Mona Awad filed a lawsuit in a San Francisco court against OpenAI, alleging that its ChatGPT language model had been trained on their copyrighted books without permission, citing ChatGPT's "very accurate" summaries of their works as evidence.[48][49] Two separate lawsuits were filed by authors Sarah Silverman, Christopher Golden and Richard Kadrey against Meta and OpenAI, arguing that in addition to copyright infringement for training their engines on their works, that products produced from the AI engines were derivative works and also copyright infringements.[50] The two suits against OpenAI were combined (during which Awad left the suit) and by February 2024, Judge Araceli Martínez-Olguín of the Northern District of California threw out all but one claim related to the use of the author's copyrighted works as part of the training data for the AI model.[51]
  • The New York Times has sued Microsoft and OpenAI in December 2023, claiming that their engines were trained on wholesale articles from the Times, which the Times considers infringement of their copyright. The Times further claimed that fair use claims made by these AI companies were invalid since the generated information around news stories directly competes with the Times and impacts the newspaper's commercial opportunities.[52]
  • Eight U.S. national newspapers owned by Tribune Publishing sued Microsoft and OpenAI in April 2024 over copyright infringement related to the use of their news articles for training data, as well as for output that creates false and misleading statements that are attributed to the newspapers.[53]
  • The Recording Industry Association of America (RIAA) and several major music labels sued the developers of Suno AI and Udio, AI models that can take text input to create songs with both lyrics and backing music, in separate lawsuits in June 2024, alledging that both AI models were trained without consent with music from the labels.[54]
  • In September 2024, the Regional Court of Hamburg dismissed a German photographer's lawsuit against the non-profit organization LAION for unauthorized reproduction of his copyrighted work while creating a dataset for AI training.[28] The decision was described as a "landmark ruling on TDM exceptions for AI training data" in Germany and EU more generally.[28]

References

[edit]
  1. ^ "Artificial Intelligence Copyright Challenges in US Courts Surge". www.natlawreview.com. Retrieved March 19, 2024.
  2. ^ "Primer: Training AI Models with Copyrighted Work". AAF. Retrieved March 19, 2024.
  3. ^ "What is the Copyright Status of AI Generated Works?". www.linkedin.com. Retrieved March 19, 2024.
  4. ^ a b c d e f g Zirpoli, Christopher T. (February 24, 2023). "Generative Artificial Intelligence and Copyright Law". Congressional Research Service.
  5. ^ a b Vincent, James (November 15, 2022). "The scary truth about AI copyright is nobody knows what will happen next". The Verge.
  6. ^ a b Guadamuz, Andres (October 2017). "Artificial intelligence and copyright". WIPO Magazine.
  7. ^ "Popular A.I. services for creating images are legal minefields for artists seeking payment for their work". Fortune. 2023. Retrieved June 21, 2023.
  8. ^ a b Peter Pink-Howitt, Copyright, AI And Generative Art, Ramparts, 2023.
  9. ^ Second Request for Reconsideration for Refusal to Register Théâtre D'opéra Spatial (SR # 1-11743923581; Correspondence ID: 1-5T5320R, 2023).
  10. ^ "Federal Register :: Request Access". unblock.federalregister.gov. Retrieved March 20, 2024.
  11. ^ Copyright Registration Guidance: Works Containing Material Generated by Artificial Intelligence, US Copyright Office 2023.
  12. ^ "'Thaler v. Perlmutter': AI Output is Not Copyrightable". New York Law Journal. Retrieved December 1, 2023.
  13. ^ a b Valinasab, Omid (2023). "Big Data Analytics to Automate Disclosure of Artificial Intelligence's Inventions" (PDF). University of San Francisco Intellectual Property and Technology Law Journal. 27 (2): 133–140 – via USF LJ.
  14. ^ "USPTO says AI models can't hold patents". February 14, 2024.
  15. ^ "New bill would force AI companies to reveal use of copyrighted art". amp.theguardian.com. Retrieved April 13, 2024.
  16. ^ Nova Production v MazoomaGames [2006] EWHC 24 (Ch).
  17. ^ The UK government's code of practice on copyright and AI. UK Government 2023.
  18. ^ Artificial Intelligence and Intellectual Property: copyright and patents: Government response to consultation. UK Government 2023.
  19. ^ https://www.govinfo.gov/content/pkg/FR-2023-08-30/pdf/2023-18624.pdf
  20. ^ Aaron Wininger (November 29, 2023). "Beijing Internet Court Recognizes Copyright in AI-Generated Images". The National Law Review.
  21. ^ Takyar, Akash (November 7, 2023). "Model validation techniques in machine learning". LeewayHertz - AI Development Company. Retrieved March 20, 2024.
  22. ^ Murray, Michael (2023). "Generative AI Art: Copyright Infringement and Fair Use". SMU Science and Technology Law Review. 26 (2): 259. doi:10.25172/smustlr.26.2.4.
  23. ^ Henderson, Peter; Li, Xuechen; Jurafsky, Dan; Hashimoto, Tatsunori; Lemley, Mark A.; Liang, Percy (2023). "Foundation Models and Fair Use". Journal of Machine Learning Research. 24 (400): 1–79. arXiv:2303.15715. Retrieved September 14, 2024.
  24. ^ a b Vincent, James (November 15, 2022). "The scary truth about AI copyright is nobody knows what will happen next". The Verge. Retrieved March 20, 2024.
  25. ^ Lee, Timothy B. (April 3, 2023). "Stable Diffusion copyright lawsuits could be a legal earthquake for AI". Ars Technica. Retrieved March 20, 2024.
  26. ^ Lee, Timothy B. (April 3, 2023). "Stable Diffusion copyright lawsuits could be a legal earthquake for AI". Ars Technica. Retrieved March 20, 2024.
  27. ^ Lemley, Mark A.; Casey, Bryan (2020). "Fair Learning". SSRN Electronic Journal. doi:10.2139/ssrn.3528447. ISSN 1556-5068.
  28. ^ a b c Goldstein, Paul; Stuetzle, Christiane; Bischoff, Susan (November 13, 2024). "Kneschke vs. LAION - Landmark Ruling on TDM exceptions for AI training data – Part 1". Kluwer Copyright Blog. Retrieved November 25, 2024.
  29. ^ a b Goldstein, Paul; Stuetzle, Christiane; Bischoff, Susan (November 14, 2024). "Kneschke vs. LAION - Landmark Ruling on TDM exceptions for AI training data – Part 2". Kluwer Copyright Blog. Retrieved November 25, 2024.
  30. ^ Mitsunaga, Takuho (October 9, 2023). "Heuristic Analysis for Security, Privacy and Bias of Text Generative AI: GhatGPT-3.5 case as of June 2023". 2023 IEEE International Conference on Computing (ICOCO). IEEE. pp. 301–305. doi:10.1109/icoco59262.2023.10397858. ISBN 979-8-3503-0268-4.
  31. ^ "Artificial Intelligence and Intellectual Property: copyright and patents: Government response to consultation". GOV.UK. Retrieved March 20, 2024.
  32. ^ a b Lee, Timothy B. (April 3, 2023). "Stable Diffusion copyright lawsuits could be a legal earthquake for AI". Ars Technica.
  33. ^ See for example OpenAI's comment in the year of GPT-2's release: OpenAI (2019). Comment Regarding Request for Comments on Intellectual Property Protection for Artificial Intelligence Innovation (PDF) (Report). United States Patent and Trademark Office. p. 9. PTO–C–2019–0038. Well-constructed AI systems generally do not regenerate, in any nontrivial portion, unaltered data from any particular work in their training corpus
  34. ^ Hans et al. 2024, §2.3
  35. ^ Peng, Zhencan; Wang, Zhizhi; Deng, Dong (June 13, 2023). "Near-Duplicate Sequence Search at Scale for Large Language Model Memorization Evaluation" (PDF). Proceedings of the ACM on Management of Data. 1 (2): 1–18. doi:10.1145/3589324. S2CID 259213212. Archived (PDF) from the original on August 27, 2024. Retrieved January 20, 2024. Citing Lee et al 2022.
  36. ^ Peng, Wang & Deng 2023, p. 8.
  37. ^ Hans, Abhimanyu; Wen, Yuxin; Jain, Neel; Kirchenbauer, John; Kazemi, Hamid; Singhania, Prajwal; Singh, Siddharth; Somepalli, Gowthami; Geiping, Jonas; Bhatele, Abhinav; Goldstein, Tom (June 14, 2024). "Be like a Goldfish, Don't Memorize! Mitigating Memorization in Generative LLMs". arXiv:2406.10209 [cs.CL], §1.
  38. ^ Hays, Kali (August 15, 2023). "ByteDance AI researchers say OpenAI now tries to hide that ChatGPT was trained on J.K. Rowling's copyrighted Harry Potter books". Business Insider. Retrieved September 15, 2024, citing Liu, Yang; Yao, Yuanshun; Ton, Jean-Francois; Zhang, Xiaoying; Guo, Ruocheng; Cheng, Hao; Klochkov, Yegor; Taufiq, Muhammad Faaiz; Li, Hang (August 10, 2023). "Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment". arXiv:2308.05374v2 [cs.AI].
  39. ^ Vincent, James (November 8, 2022). "The lawsuit that could rewrite the rules of AI copyright". The Verge. Retrieved December 7, 2022.
  40. ^ James Vincent "AI art tools Stable Diffusion and Midjourney targeted with copyright lawsuit" The Verge, 16 January, 2023.
  41. ^ a b Edwards, Benj (January 16, 2023). "Artists file class-action lawsuit against AI image generator companies". Ars Technica.
  42. ^ Brittain, Blake (July 19, 2023). "US judge finds flaws in artists' lawsuit against AI companies". Reuters. Retrieved August 6, 2023.
  43. ^ Cho, Winston (October 30, 2023). "Artists Lose First Round of Copyright Infringement Case Against AI Art Generators". Hollywood Reporter. Retrieved April 30, 2024.
  44. ^ Robertson, Adi (August 13, 2024). "Artists' lawsuit against Stability AI and Midjourney gets more punch". The Verge. Retrieved August 13, 2024.
  45. ^ Korn, Jennifer (January 17, 2023). "Getty Images suing the makers of popular AI art tool for allegedly stealing photos". CNN. Retrieved January 22, 2023.
  46. ^ "Getty Images Statement". newsroom.gettyimages.com/. January 17, 2023. Retrieved January 24, 2023.
  47. ^ Belanger, Ashley (February 6, 2023). "Getty sues Stability AI for copying 12M photos and imitating famous watermark". Ars Technica.
  48. ^ Ngila, Faustine (July 6, 2023). "The copyright battles against OpenAI have begun". Quartz. Retrieved November 25, 2024.
  49. ^ Kris, Jimmy (July 6, 2023). "OpenAI faces copyright lawsuit from authors Mona Awad and Paul Tremblay". DailyAi. Retrieved July 10, 2023.
  50. ^ Davis, Wes (July 9, 2023). "Sarah Silverman is suing OpenAI and Meta for copyright infringement". The Verge. Retrieved April 30, 2024.
  51. ^ David, Emilla (February 13, 2024). "Sarah Silverman's lawsuit against OpenAI partially dismissed". The Verge. Retrieved April 30, 2024.
  52. ^ O'Brien, Matt (January 10, 2024). "New York Times and authors on 'fair use' of copyrighted works". Associated Press News. Retrieved April 30, 2024.
  53. ^ Allyn, Bobby (April 30, 2024). "Eight newspapers sue OpenAI, Microsoft for copyright infringement". NPR. Retrieved April 30, 2024.
  54. ^ Sato, Mia (June 24, 2024). "Major record labels sue AI company behind 'BBL Drizzy'". The Verge. Retrieved June 24, 2024.
[edit]