AI Loophole #1; Your GitHub README.md

Elias Griffin@lemmy.world · edit-2 5 months ago

AI Loophole #1; Your GitHub README.md

bamboo · 5 months ago

How would an LLM answering questions about a git repo be legally different from a person answering those same questions (think stackoverflow)? Specific to this case, US law does not consider “APIs” to be copyrightable (Oracle v Google, Google reimplemented Java using the same APIs but their own implementation code, court ruled that Oracle couldn’t copyright the APIs).

Regarding “replace”, the primary use of the git repo is the code itself, not the Q&A about how to use it. The LLM doesn’t generate code that fully replaces that library or program, or if it does, it is distinct enough to be a different work.

AlexanderESmith@social.alexanderesmith.com · 5 months ago

First, a chat bot is not an API. Second, they were talking about the the formatting and delivery method of the data, not the content.

Regarding the output of the model: Some repos are entirely READMEs by their nature. No code, just documentation and walkthroughs. Notwithstanding that; If I set a flag that’s says “don’t use my data” and they use it anyway, that’s theft, even if it’s only one file, even if the file is just a description of the code. That’s my work, not yours. You don’t get to use it however you want, unless I specifically note that it’s public domain (or you use it and follow the license, like attributing me, or linking to the repo, etc).

As to the difference between a bot and a human (re: stack overflow)? The former is a representative of a company (automation or not, whether it’s a bot or a page on their corporate site), the latter is a person relating experience and opinion. The legal difference is that one is using the data commercially, and the other is just a person in the world, answering another person’s question for no reason other than a desire to be helpful (and if they’re decent, attributing the source instead of claiming that they’re generating wisdom on their own).

That last parenthetical used to be called plagiarism, by the way.

bamboo · 5 months ago

Regarding the output of the model: Some repos are entirely READMEs by their nature. No code, just documentation and walkthroughs.

So fun fact, did you know that facts aren’t copyrightable? The specific wording used to convey them is copyrightable, but the content is not! Applied to documentation, that means that the only part of your documentation that you own are the specific words you chose to convey them, to the extent that those words constitute a creative work. If I re-word your docs, I can republish them without any copyright concerns. The same applies to an LLM. Legally anyone (including a company or LLM) can re-use the facts, they just have to make sure they express them in different words. Even then, they can re-use some of your words so long as the wording couldn’t be considered a creative work.

The legal difference is that one is using the data commercially, and the other is just a person in the world, answering another person’s question for no reason other than a desire to be helpful

Section 270 notwithstanding, the only detail potentially relevant here is commercialization. In the US, corporations are people, and the law generally does not make something legal for an individual but illegal for a corporation. But even then, not all LLM providers are commercial. Many release their models freely under open source licenses, or otherwise give them away for free. Can’t argue commercial use if it is a research-focused non-profit licensing their models under Apache-2. But even then, commercial use being a threshold for violation likely only matters if you can prove it caused you financial loss. Good luck proving that when you’re already putting the information freely on the internet.