AI Loophole #1; Your GitHub README.md

Elias Griffin@lemmy.world · edit-2 5 months ago

AI Loophole #1; Your GitHub README.md

AlexanderESmith@social.alexanderesmith.com · edit-2 5 months ago

you got some criticism and now you’re saying everyone else is a bot or has an agenda

Please look up ad hominem, and stop doing it. Yes, their responses are a distraction from the topic at hand, but so were the random posts calling OP paranoid. I’d have been on the defensive too.

[Our company] publish[s] open source work … anyone is free to use it for any purpose, AI training included

Great, I hope this makes the models better. But you made that decision. OP clearly didn’t. In fact, they attempted to use several methods to explicitly block it, and the model trainers did it anyway.

I think that the anti-AI hysteria is stupid virtue signaling for luddites

Many loudly outspoken figures against the use of stolen data for the training of generative models work in the tech industry, myself included (I’ve been in the industry for over two decades). We’re far from Luddites.

LLMs are here

I’ve heard this used as a justification for using them, and reasonable people can discuss the merits of the technology in various contexts. However, this is not a justification for defending the blatant theft of content to train the models.

whether or not they train on your random project isn’t going to affect them in any meaningful way

And yet, they did it while ignoring explicit instructions to the contrary.

there are more than enough fully open source works to train on

I agree, and model trainers should use that content, instead of whatever they happen to grab off every site they happen to scrape.

Better to have your work included so that the LLM can recommend it to people or answer questions about it

I agree if you give permission for model trainers to do so. That’s not what happened here.

bamboo · 5 months ago

Why do you think they need your permission to use information you posted publicly to train their models? Copyright isn’t unlimited, and model training is probably fair use.

AlexanderESmith@social.alexanderesmith.com · 5 months ago

“Your honor, we can use whatever data we want because model training is probably fair use, or whatever”.

I don’t know what’s worse, the fact that you think creators don’t have the right to dictate how their works are used, or that you apparently have no idea what fair use is.

This might help; https://copyright.gov/fair-use/

bamboo · edit-2 5 months ago

I mean, this is how courts work. Someone will sue because a work they hold copyright to was used in a training set without their authorization, the defendant will claim it was fair use, the judge will pick a side. To the best of my knowledge this hasn’t happened just yet, and since I’m not a judge, I use “probably”. Fair use is both vague and broad, and this is important to ensure copyright holders don’t have complete control over their work. It was recognized a long time ago that you can make works that utilize another copyrighted work, but don’t functionally replace the original work, and are therefore fair use. The whole point was to try and foster innovation, not to allow copyright holders to dictate how their works are used, and fair use is an essential part of that.

Training an LLM with a work doesn’t functionally replace that work. If there is a filter that prevents 1:1 reproduction, then it literally cannot. It also provides significant benefit to have these LLMs, they are a unique and valuable work themselves. That’s why it’s fair use.

AlexanderESmith@social.alexanderesmith.com · 5 months ago

Agreed on all points, except my personal interpretation of “fair use” specific to the case of generative models.

You call out “doesn’t replace the original work”. Is that not how you see an LLM Q/A bot replacing a user going to a git repo for established examples, or a website for an article (generating page views, subscriptions, ad revenue), or similar? Why would anyone go to the source materials if they’re getting their answer from the bot?

This is practically the same as when Google started showing articles in AMP, and not bringing people to the original website, is it not?

bamboo · 5 months ago

How would an LLM answering questions about a git repo be legally different from a person answering those same questions (think stackoverflow)? Specific to this case, US law does not consider “APIs” to be copyrightable (Oracle v Google, Google reimplemented Java using the same APIs but their own implementation code, court ruled that Oracle couldn’t copyright the APIs).

Regarding “replace”, the primary use of the git repo is the code itself, not the Q&A about how to use it. The LLM doesn’t generate code that fully replaces that library or program, or if it does, it is distinct enough to be a different work.

AlexanderESmith@social.alexanderesmith.com · 5 months ago

First, a chat bot is not an API. Second, they were talking about the the formatting and delivery method of the data, not the content.

Regarding the output of the model: Some repos are entirely READMEs by their nature. No code, just documentation and walkthroughs. Notwithstanding that; If I set a flag that’s says “don’t use my data” and they use it anyway, that’s theft, even if it’s only one file, even if the file is just a description of the code. That’s my work, not yours. You don’t get to use it however you want, unless I specifically note that it’s public domain (or you use it and follow the license, like attributing me, or linking to the repo, etc).

As to the difference between a bot and a human (re: stack overflow)? The former is a representative of a company (automation or not, whether it’s a bot or a page on their corporate site), the latter is a person relating experience and opinion. The legal difference is that one is using the data commercially, and the other is just a person in the world, answering another person’s question for no reason other than a desire to be helpful (and if they’re decent, attributing the source instead of claiming that they’re generating wisdom on their own).

That last parenthetical used to be called plagiarism, by the way.

bamboo · 5 months ago

Regarding the output of the model: Some repos are entirely READMEs by their nature. No code, just documentation and walkthroughs.

So fun fact, did you know that facts aren’t copyrightable? The specific wording used to convey them is copyrightable, but the content is not! Applied to documentation, that means that the only part of your documentation that you own are the specific words you chose to convey them, to the extent that those words constitute a creative work. If I re-word your docs, I can republish them without any copyright concerns. The same applies to an LLM. Legally anyone (including a company or LLM) can re-use the facts, they just have to make sure they express them in different words. Even then, they can re-use some of your words so long as the wording couldn’t be considered a creative work.

The legal difference is that one is using the data commercially, and the other is just a person in the world, answering another person’s question for no reason other than a desire to be helpful

Section 270 notwithstanding, the only detail potentially relevant here is commercialization. In the US, corporations are people, and the law generally does not make something legal for an individual but illegal for a corporation. But even then, not all LLM providers are commercial. Many release their models freely under open source licenses, or otherwise give them away for free. Can’t argue commercial use if it is a research-focused non-profit licensing their models under Apache-2. But even then, commercial use being a threshold for violation likely only matters if you can prove it caused you financial loss. Good luck proving that when you’re already putting the information freely on the internet.

Victoria Antoinette @lemmy.world · 5 months ago

authors should have no say in how published works are used.

catloaf · 5 months ago

Authors shouldn’t be paid for their labor?

Victoria Antoinette @lemmy.world · 5 months ago

I didn’t say that. you’re making a leap of logic

catloaf · 5 months ago

Yes, I am. Logically, if an author creates something and cannot control its distribution, it is available to everyone at no cost, therefore the author will never see a dime for their labor.

This discounts the donation model, because in practice, it rarely pays the bills. It also ignores patronage, because I doubt that you want the creation of art to be dependent on the generosity of the rich.

Thus, it makes sense for the author to maintain certain rights over the product of their labor. They provide the work under their terms, e.g. requiring payment for a copy, and that relatively low cost to the average Joe provides the money they need to buy food, pay rent, etc.

Victoria Antoinette @lemmy.world · 5 months ago

you recognize two well known cases where copyright is not necessary to get paid. I don’t think there is even an argument at this point. have a nice day.

catloaf · 5 months ago

Yes, and I said they’re not feasible, because they’ve been tried in the past and present and found to not work very well. If you disagree, I’m happy to hear your thoughts.

AlexanderESmith@social.alexanderesmith.com · 5 months ago

I already replied to the essence of this in my reply to your other post about how “illegal downloads aren’t theft because its a copy”, but I’ll mention here that this is even more evidence that you aren’t a creator, and I suggest that your opinions on this subject aren’t relevant, and you should avoid subjecting other people to them.

Victoria Antoinette @lemmy.world · 5 months ago

your attacks on my identity don’t undercut my claims at all.

AlexanderESmith@social.alexanderesmith.com · 5 months ago

“evidence suggests that you probably aren’t a creator” “As a result, I suggests that your opinions aren’t relevant”

Aside from the fact that these are not character attacks, I encourage you to refute my assumptions. Otherwise, my points will stand on their own.

Victoria Antoinette @lemmy.world · 5 months ago

on the internet, no one knows you’re a dog. whether I have or not, saying so doesn’t prove it. what I said stands on its own merits and your inability to make an argument without attacking identity speaks to the strength of your argument, your understanding of the subject, and your ability (or willingness) to engage in good faith.