[Discussion] AI companies using the fediverse to scrape data

Nicarlo@sh.itjust.works · edit-2 2 hours ago

[Discussion] AI companies using the fediverse to scrape data

weker01@sh.itjust.works · 54 minutes ago

I don’t care tbh. I am writing everything here as if everyone at any time could read it.

mindbleach@sh.itjust.works · 3 hours ago

Folks, if you can see it, they can see it.

I don’t give a shit if the robot scrapes every book ever sold. I am not about to get worked-up over the copyright claims of pseudonymous randos’ disposable internet commentary.

MachineFab812@discuss.tchncs.de · edit-2 1 hour ago

Scrape*, for your title.

Meanwhile, preventing un-paid scraping was a big part of Reddit’s rationalle for their en-shitification, ie, charging for API access.

I would rather train an AI indirectly for free than ask random Instances to run interference, which IRL works out to be pay-walling and selling user content.

By asking Lemmy Instances to “prevent AI from seeing my content”, all you are really asking them to do is to slap a price-tag on it, and hire lawyers to pursue companies/users that don’t pay. Not pay you or me, but them.

degen@midwest.social · 2 hours ago

Yeah, I’m more worried about the output of AI getting involved than anything regarding the input, at least as far as a public forums go.

Zachariah@lemmy.world · 5 hours ago

typos are mportant to undermine the scrapping

AbouBenAdhem@lemmy.world · edit-2 4 hours ago

My main issue with the Reddit deal (and similar data grabs) is that AI companies are hoarding user-generated content to give themselves a competitive advantage. I have less of an issue with them using non-exclusive public content like Wikipedia, fediverse comments, and public-domain historical works.

redrum@lemmy.ml · edit-2 4 hours ago

Server admins could add in the policy that any AI scrapping requires the previous permission of the copyright holders of the contents (i.e., the users) when the scrap is done for exploitation of the data for greed. Also, the robots.txt could be used to forbid AI HTML scrap.

I don’t think that restrictions should be added at a protocol level, but, may be, some declarative tags should be fine:

{
"rich": "eat",
"about-meta": "fck-genocidal-and-youth-suicidal-promoter-zuckenberg",
"ai": "not-for-greed"
}

Nicarlo@sh.itjust.works · 1 hour ago

I think this would be the only way. It would be interesting to knowing how much traffic or requests this instance gets to see if its a real problem. Server admins could implement stricter rate limiting for non-members if it becomes an issue. They could even likely implement something that could allow them to sort out which of their members are making the most requests to have some visibility. I don’t believe this is something that is possible today from within platform anyway.

There’s really two issues here:

If users are ok and even aware that their public conversations are certainly going to be picked up and used for future models
Are the lemmy instance admins ok with potentially half of their traffic going to bots that are hoarding and scrapping the data causing additional load on the servers.

Maybe @TheDude@sh.itjust.works would be open to share some insights regarding to the amount of requests is received per month and how much resources its taking