- cross-posted to:
- leftpiracy@lemmygrad.ml
- elp@lemmy.intai.tech
- libgen@lemmy.world
- cross-posted to:
- leftpiracy@lemmygrad.ml
- elp@lemmy.intai.tech
- libgen@lemmy.world
cross-posted from: https://lemmy.world/post/1330512
Below are direct quotes from the filings.
OpenAI
As noted in Paragraph 32, supra, the OpenAI Books2 dataset can be estimated to contain about 294,000 titles. The only “internet-based books corpora” that have ever offered that much material are notorious “shadow library” websites like Library Genesis (aka LibGen), Z-Library (aka B-4ok), Sci-Hub, and Bibliotik. The books aggregated by these websites have also been available in bulk via torrent systems. These flagrantly illegal shadow libraries have long been of interest to the AI-training community: for instance, an AI training dataset published in December 2020 by EleutherAI called “Books3” includes a recreation of the Bibliotik collection and contains nearly 200,000 books. On information and belief, the OpenAI Books2 dataset includes books copied from these “shadow libraries,” because those are the most sources of trainable books most similar in nature and size to OpenAI’s description of Books2.
Meta
Bibliotik is one of a number of notorious “shadow library” websites that also includes Library Genesis (aka LibGen), Z-Library (aka B-ok), and Sci-Hub. The books and other materials aggregated by these websites have also been available in bulk via torrent systems. These shadow libraries have long been of interest to the AI-training community because of the large quantity of copyrighted material they host. For that reason, these shadow libraries are also flagrantly illegal.
This article from Ars Tecnica covers a few more details. Filings are viewable at the law firm’s site here.
I’m ok with PEOPLE reading books in any way for self improvement.
But, when a FUCKING COMPANY starts screwing with shit like this, thats when they crossed the line.
Sure but you understand that publishers dont give a fuck about any of that. They find any way to shut these things down they can. Not to mention the things on Sci-Hub and Libgen should be free public knowledge to anyone or anything that wants it. Its full of tax funded research papers and textbooks. That information should belong to everyone and everything. Thats not a crossed line. Thats consistency.
I agree with u, it should be free to every PERSON who wants it.
As i said before thats the fundamental difference between individuals and a company stealing.
We dont agree. Its not stealing and companies should have access to the same free information.