Zuckerberg approved use of pirated content to train Meta's AI model, amended authors' lawsuit claims

Meta faced a copyright infringement lawsuit back in 2023, but a new filing made public on Wednesday (January 8) revealed shocking claims that could put the company and its CEO, Mark Zuckerberg, in an uncomfortable spotlight. The group of plaintiffs in this lawsuit alleges that Meta used pirated content (including copyrighted books and articles) to train its Llama AI models with permission from its CEO, Mark Zuckerberg.

This row started back in 2023 when a group of authors sued the social media giant for using their books and articles to train its large language model – Llama – without their consent or approval, as the content is copyrighted. Ta-Nehisi Coates (author and journalist), Sarah Silverman (comedian and actress), and other authors are among the plaintiffs.

However, in November 2023, U.S. District Judge Vince Chhabria dismissed the AI copyright lawsuit against Meta. The argument that text generated by Meta’s chatbots infringed the authors’ copyrights and that Meta unlawfully stripped their books’ copyright management information (CMI) was believed invalid by the court.

But now in a recent development, the authors requested permission from the U.S. District Court for the Northern District of California to submit an updated complaint. In this fresh filing, the authors claim that internal documents provided by Meta during the discovery process reveal the company was aware that the content used for AI training was pirated.

Additionally, they suggest that new evidence has emerged indicating Meta used a dataset called LibGen, which is believed to contain millions of pirated works. Furthermore, they have accused Meta of distributing this dataset through peer-to-peer torrents. This is a method of sharing files directly between users without a central server.

In fact, the plaintiffs – citing Meta’s internal communications – now claim that Mark Zuckerberg is fully aware of the situation and himself approved the use of the LibGen dataset, even knowing that it contains pirated content. Self-proclaimed as a ‘links aggregator,’ LibGen primarily offers access to copyrighted materials from major publishers like Macmillan Learning, McGraw Hill, and Cengage Learning. It has a long history of facing lawsuits and fines for copyright infringement.

In this new filing, submitted on Wednesday, another significant accusation is highlighted. It suggests that Meta may have tried to hide its alleged copyright infringement by removing credit or attribution from the LibGen data it used.

While Meta has consistently denied any unlawful use of content to train its large language model, the company has not issued any official statement yet regarding the new filing by authors. Recently, the social media powerhouse – and owner of Facebook, Instagram, WhatsApp, and Threads – found itself in hot water following changes to its content moderation policies. As per the recent policy update, the Mark Zuckerberg-led company has decided to ‘get rid of fact-checkers‘ and replace them with community notes, similar to X.

The Tech Portal is published by Blue Box Media Private Limited. Our investors have no influence over our reporting. Read our full Ownership and Funding Disclosure →

Ashutosh Singh

Ashutosh is a Senior Writer at The Tech Portal, largely reporting on new tech, and intersection of technology and business. Ashutosh’s career spans across nearly a decade of technology writing across multiple platforms and languages.