Jump to content

Artificial intelligence and copyright

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by Rolf h nelson (talk | contribs) at 06:26, 21 June 2023 (References). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

In the 2020s, the rapid increase in the capabilities of deep learning-based generative artificial intelligence models, including text-to-image models such as Stable Diffusion and large language models such as ChatGPT, posed questions of how copyright law applies to the training and use of such models. Because there is limited existing case law, experts consider this area to be fraught with uncertainty.[1]

The largest issue regards whether infringement occurs when the generative AI is trained or used.[1] Popular deep learning models are generally trained on very large datasets of media scraped from the Internet, much of which is copyrighted. Since the process of assembling training data involves making copies of copyrighted works it may violate the copyright holder's exclusive right to control the reproduction of their work, unless the use is covered by exceptions under a given jurisdiction's copyright statute. Additionally, the use of a model's outputs could be infringing, and the model creator may be accused of "vicarious liability" for said infringement. As of 2023, there are a number of pending US lawsuits challenging the use of copyrighted data to train AI models, with defendants arguing that this falls under fair use.

Another issue is that, in jurisdictions such as the US, output generated solely by a machine is ineligible for copyright protection, as most jurisdictions protect only "original" works having a human author. However, some have argued that the operator of an AI may qualify for copyright if they exercise sufficient originality in their use of an AI model.

The United States Copyright Office has declared that works not created by a human author, such as this "selfie" portrait taken by a monkey, are not eligible for copyright protection.

Most legal jurisdictions grant copyright only to original works of authorship by human authors.[2] In the US, the Copyright Act protects "original works of authorship".[3] The U.S. Copyright Office has interpreted this as being limited to works "created by a human being",[3] declining to grant copyright to works generated solely by a machine.[1]

Some have suggested that certain AI generations might be copyrightable in the US and similar jurisdictions if it can be shown that the human who ran the AI program exercised sufficient originality in selecting the inputs to the AI or editing the AI's output.[1][3] Proponents of this view suggest that an AI model may be viewed as merely a tool (akin to a pen or a camera) used by its human operator to express their creative vision.[3][2]

As AI is increasingly used to generate literature, music, and other forms of art, the US Copyright Office has released new guidance emphasizing whether works, including materials generated by artificial intelligence, exhibit a 'mechanical reproduction' nature or are the 'manifestation of the author's own creative conception'.[4]

Some jurisdictions include explicit statutory language related to computer-generated works, including the United Kingdom's Copyright, Designs and Patents Act 1988, which states:

In the case of a literary, dramatic, musical or artistic work which is computer-generated, the author shall be taken to be the person by whom the arrangements necessary for the creation of the work are undertaken.[2]

However, such language is ambiguous as to whether it refers to the programmer who trained the model, or the user who operated the model to generate a particular output.[2]

Training on copyrighted data

Popular deep learning models are generally trained on very large datasets of media (such as publicly available images and the text of web pages) scraped from the Internet, much of which is copyrighted. Because assembling these training datasets involves making copies of copyrighted works, this has raised the question of whether this process infringes the copyright holders' exclusive right to make reproductions of their works.[3] Machine learning developers in the US have traditionally presumed this to be allowable under fair use, because the use of copyrighted work is transformative, and limited.[3][1] The situation has been compared to Google Books's scanning of copyrighted books in Authors Guild, Inc. v. Google, Inc., which was ultimately found to be fair use, because the scanned content was not made publicly available, and the use was non-expressive.[5]

As of 2023, there were a number of US lawsuits disputing this, arguing that the training of machine learning models infringed the copyright of the authors of works contained in the training data.[3] Timothy B. Lee, in Ars Technica, argues that if the plaintiffs succeed, this may shift the balance of power in favour of large corporations such as Google, Microsoft and Meta which can afford to license large amounts of training data from copyright holders and leverage their own proprietary datasets of user-generated data.[5]

A number of jurisdictions have explicitly incorporated exceptions allowing for "text and data mining" (TDM) in their copyright statutes including the United Kingdom, Germany, Japan, and the EU.[6] Unlike the EU, the United Kingdom prohibits data mining for commercial purposes. As of June 2023, a clause in the draft EU AI Act would require generative AI to "make available summaries of the copyrighted material that was used to train their systems".[7]

A photograph of Anne Graham Lotz included in Stable Diffusion's training set
An image generated by Stable Diffusion using the prompt "Anne Graham Lotz"
In rare cases, generative AI models may produce outputs that are virtually identical to images from their training set. The research paper from which this example was taken was able to produce similar replications for only 0.03% of training images.[5]
An image generated by Stable Diffusion using the prompt "an astronaut riding a horse, by Picasso". Generative image models are adept at imitating the visual style of particular artists in their training set.

In some cases, deep learning models may "memorize" the details of particular items in their training set, and reproduce them at generation time, such that their outputs may be considered copyright infringement. This behaviour is generally considered undesirable by AI developers (being considered a form of overfitting), and disagreement exists as to how prevalent this behaviour is in modern systems. OpenAI has argued that "well-constructed AI systems generally do not regenerate, in any nontrivial portion, unaltered data from any particular work in their training corpus".[3] Under US law, to prove that an AI output infringes a copyright, a plaintiff must show the copyrighted work was "actually copied", meaning that the AI generates output which is "substantially similar" to their work, and that the AI had access to their work.[3]

Since fictional characters enjoy some copyright protection in the US and other jurisdictions, an AI may also produce infringing content in the form of novel works which incorporate fictional characters.[3][5]

In the course of learning to statistically model the data on which they are trained, deep generative AI models may learn to imitate the distinct style of particular authors in the training set. For example, a generative image model such as Stable Diffusion is able to model the stylistic characteristics of an artist like Pablo Picasso (including his particular brush strokes, use of colour, perspective, and so on), and a user can engineer a prompt such as "an astronaut riding a horse, by Picasso" to cause the model to generate a novel image applying the artist's style to an arbitrary subject. However, an artist's overall style is generally not subject to copyright protection.[3]

Litigation

  • A November 2022 class action lawsuit against Microsoft, GitHub and OpenAI alleged that GitHub Copilot, an AI-powered code editing tool trained on public GitHub repositories, violated the copyright of the repositories' authors, noting that the tool was able to generate source code which matched its training data verbatim, without providing attribution.[8]
  • In January 2023 three artists — Sarah Andersen, Kelly McKernan, and Karla Ortiz — filed a class action copyright infringement lawsuit against Stability AI, Midjourney, and DeviantArt, claiming that these companies have infringed the rights of millions of artists by training AI tools on five billion images scraped from the web without the consent of the original artists.[9] The plaintiffs' complaint has been criticized for technical inaccuracies, such as incorrectly claiming that "a trained diffusion model can produce a copy of any of its Training Images", and describing Stable Diffusion as "merely a complex collage tool".[10] In addition to copyright infringement, the plaintiffs allege unlawful competition and violation of their right of publicity in relation to AI tools' ability to create works in the style of the plaintiffs en masse.[10]
  • In January 2023, Stability AI was sued in London by Getty Images for using its images in their training data without purchasing a license.[11][12]
  • Getty filed another suit against Stability AI in a US district court in Delaware in February 2023. The suit again alleges copyright infringement for the use of Getty's images in the training of Stable Diffusion, and further argues that the model infringes Getty's trademark by generating images with Getty's watermark.[13]

References

  1. ^ a b c d e Vincent, James (15 November 2022). "The scary truth about AI copyright is nobody knows what will happen next". The Verge.
  2. ^ a b c d Guadamuz, Andres (October 2017). "Artificial intelligence and copyright". WIPO Magazine.
  3. ^ a b c d e f g h i j k Zirpoli, Christopher T. (24 February 2023). "Generative Artificial Intelligence and Copyright Law". Congressional Research Service.
  4. ^ Yurkevich, Vanessa (2023-04-18). "Universal Music Group calls AI music a 'fraud,' wants it banned from streaming platforms. Experts say it's not that easy". CNN.
  5. ^ a b c d Lee, Timothy B. (3 April 2023). "Stable Diffusion copyright lawsuits could be a legal earthquake for AI". Ars Technica.
  6. ^ Lee, Jyh-An; Hilty, Reto; Liu, Kung-Chung, eds. (2021). Artificial Intelligence and Intellectual Property. Oxford University Press. doi:10.1093/oso/9780198870944.001.0001. ISBN 978-0-19-887094-4.
  7. ^ Rozen, Miriam (14 June 2023). "Lawyers keep an eye on copyright risk with generative AI". Financial Times. Retrieved 21 June 2023.
  8. ^ Vincent, James (2022-11-08). "The lawsuit that could rewrite the rules of AI copyright". The Verge. Retrieved 2022-12-07.
  9. ^ James Vincent "AI art tools Stable Diffusion and Midjourney targeted with copyright lawsuit" The Verge, 16 January, 2023.
  10. ^ a b Edwards, Benj (16 January 2023). "Artists file class-action lawsuit against AI image generator companies". Ars Technica.
  11. ^ Korn, Jennifer (2023-01-17). "Getty Images suing the makers of popular AI art tool for allegedly stealing photos". CNN. Retrieved 2023-01-22.
  12. ^ "Getty Images Statement". newsroom.gettyimages.com/. 17 January 2023. Retrieved 24 January 2023.
  13. ^ Belanger, Ashley (6 February 2023). "Getty sues Stability AI for copying 12M photos and imitating famous watermark". Ars Technica.