Stack Overflow bans users en masse for rebelling against OpenAI partnership — users banned for deleting answers to prevent them being used to train ChatGPT

lemmyreader@lemmy.ml · 4 months ago

Stack Overflow bans users en masse for rebelling against OpenAI partnership — users banned for deleting answers to prevent them being used to train ChatGPT

Matt The Horwood@lemmy.horwood.cloud · 4 months ago

Why delete the answer, why not edit it so that a human can see the answer but for AI its a load of nonsense?

Matt The Horwood@lemmy.horwood.cloud · 4 months ago

So we need to up vote wrong answers only?

Skull giver@popplesburger.hilciferous.nl · 4 months ago

If that would happen, I assume companies would just grab an older copy of the dumps from before people started editing their stuff because of the AI bullshit.

SA would ban everyone sabotaging their business plans and things would move on like normal, like what happened to Reddit.

zovits@lemmy.world · 4 months ago

Editing any content to reduce its quality is considered vandalism and gets reverted on SO.

gjoel@programming.dev · 4 months ago

People did that. Stack overflow reverted the change.

chicken@lemmy.dbzer0.com · 4 months ago

There’s no way that would work either, they can just store the full edit history and auto-curate as needed.

delirious_owl@discuss.online · 4 months ago

Like AI doesn’t know how to use the way back machine?

Scott@sh.itjust.works · 4 months ago

Based users

CaptObvious@literature.cafe · 4 months ago

Stack Overflow just earned a place under Reddit in the hosts block list.

stembolts@programming.dev · edit-2 4 months ago

Like when I heard reddit was doing the API lockdown, I wrote an automation bot over the weekend that self-destructed my subreddit and the entire post history.

The bot also automatically downloaded and archived all of the content on my local machine, and because at the time reddit had changed their API to only show the first X posts (100 or 1,000 or whatever) as my bot deleted the most recent posts, reddit had no choice but to show me the old content.

And that’s how I archived my subreddit. Reddit banned me two days later for automation, lol. I did not break any of the reddit or reddit api ToS during this process but I guess I upset someone.

ubergeek77@lemmy.ubergeek77.chat · 4 months ago

I don’t think I’ve been banned, but I did a similar thing. I requested all my data from Reddit, then used that list of comment/post IDs to mass-edit them. I think I’m in the clear because I used the official third party API, with an official “app.” If you used the private API or instrumented this via the browser, that may be why you were banned.

Anyway, if you or someone else wants their full history, Reddit will give it to you via a data export request.

GBU_28@lemm.ee · 4 months ago

Unfortunately they still have everything. It’s good for the “human” visibility (lack of) but they have the data still

stembolts@programming.dev · edit-2 4 months ago

Oh I know, I just wanted a copy too.

Deleting posts from the user PoV was the only way I could come up with to force the API to show them to me.

SuckMyWang@lemmy.world · 4 months ago

We can’t even communicate with out being leeched upon. Fuck this is grim

jubilationtcornpone@sh.itjust.works · 4 months ago

Data Rule Numero Uno:

Garbage in, garbage out.

Have fun training your LLM on a big steaming pile of hot garbage. That’s 80% of Stack Overflows content.

mnemonicmonkeys@sh.itjust.works · 4 months ago

One time I was went on there to figure out an issue in Arduino. The answer one guy gave was “I don’t know how to do this in Arduino, here’s how you do this in Java”. Not only the the mods prevent any other answers from being posted, I tried the guy’s suggestion in Java and it didn’t even work

LostXOR@fedia.io · 4 months ago

The other 20% is mostly high quality however, and I’m sure they’d filter out the heavily downvoted crud.

mnemonicmonkeys@sh.itjust.works · 4 months ago

You say that as if the garbage gets downvoted

harrys_balzac@lemmy.dbzer0.com · 4 months ago

Mostly “this has been answered in another thread” and “why don’t you Google it” comments in my experience.

ddh@lemmy.sdf.org · 4 months ago

Can’t wait until the top answer to every Google search is “just google it”

baseless_discourse@mander.xyz · 4 months ago

This is a violation of GDPR, no?

TachyonTele@lemm.ee · 4 months ago

How so?

baseless_discourse@mander.xyz · 4 months ago

User should have the right to delete their data stored by the company.

flux@lemmy.ml · 4 months ago

Would that kind of provision allow me to have my code removed from a git repository history, if that git repository is hosted by a company?

interdimensionalmeme@lemmy.ml · 4 months ago

As long as you didn’t give those rights by signing a CLA or a copyleft license. Never sign a CLA unless you’re fully compensated.

baseless_discourse@mander.xyz · edit-2 4 months ago

I am not a lawyer, but I believe in general, yes.

Git is not even that convoluted, as all the history is stored in the .git folder within the repo. Unless there is some convoluted structure built on top, they would only need to move the repo folder to a trash disk, waiting to be formated.

That being said, GDPR is somewhat poorly enforced at the moment, unfortunately. I don’t know if you can sue the company and expect some result within couple of years.

refalo@programming.dev · 4 months ago

No because user generated content is not protected.

WldFyre@lemm.ee · 4 months ago

Doesn’t that just mean the data would have to be anonymized ?

baseless_discourse@mander.xyz · 4 months ago

I am not a expert or a lawyer, but I believe user actually hold the right to completely erase personal data:

The data subject shall have the right to obtain from the controller the erasure of personal data concerning him or her without undue delay and the controller shall have the obligation to erase personal data without undue delay

https://gdpr.eu/right-to-be-forgotten/

Note the word “erasure” as opposed to “anonymize”

WldFyre@lemm.ee · 4 months ago

I don’t think that addresses my point. Is my opinion on the new Star Wars movies that I post online or some lines of code I suggest “personal data”? I thought personal data had a specific definition under GDPR

Spaenny@discuss.tchncs.de · 4 months ago

Technically, they could retain posts from users if they are irreversibly anonymized. However, ensuring with 100% certainty that none of your posts ever contained any personal data that could lead to the identification of you as an individual is challenging. The safest option is therefore to also delete your posts.

nefonous@lemmy.world · 4 months ago

You’re totally right, the content of your posts is not considered personal data (because it isn’t) It’s more about profiling data that can be connected back to your actual person

baseless_discourse@mander.xyz · 4 months ago

I think you are right, user generated content doesn’t seem to be protected. This is surprising to me, as user should hold the right to their content, which in my mind should enjoy stronger protection than personal data.

lemmyreader@lemmy.ml · 4 months ago

Dunno. GDPR is a Europe only thing, and isn’t it only related to how your private data (like name, IP address, phone number) is cared about ?

AccountMaker@slrpnk.net · 4 months ago

Right, I think it only covers personal information: companies can only collect what they need to run their service, users can request to see their data etc. I don’t think it applies to comments and posts.

Captain Beyond@linkage.ds8.zone · 4 months ago

I would certainly hope so. Stack Overflow content is Creative Commons licensed, so the argument is basically that the GDPR would take precedence over the CC license grant. It’d be scary if GDPR could be weaponized against forks of free software projects in this manner.

henfredemars@infosec.pub · 4 months ago

I feel like this content craze is going to evaporate soon because all the new content from here forward is sure to be polluted by LLM output already. AI is fast becoming a snake eating its own tail.

That reminds me. I should go update my licenses to spit in the face of AI training companies.

Captain Beyond@linkage.ds8.zone · edit-2 4 months ago

There is, I believe, a fundamental misunderstanding as to what exactly a site like Stack Overflow is. It’s not a forum; there’s no such thing as “your posts.” It’s more like Wikipedia, as in a collaborative question-and-answer site, or a knowledgebase. Each question and answer can be edited like a mini wiki page. They aren’t “yours” any more than the Wikipedia page you created ten years ago is; you contributed it to the commons, so (at least in theory) you don’t have the right to take it back.

Whether whatever "Open"AI is doing is right is another question, of course. But, I don’t think destroying or poisoning the commons to strike back at it is any helpful either; it feels like “destroying it to save it.”

tetris11@lemmy.ml · 4 months ago

Fine, but when coding projects undergo licensing changes that the contributors are against, the code author has to remove those contributions and replace them.

drunkpostdisaster@lemmy.world · 4 months ago

This shit scares me. It will become so easy to rewrite history from here. Just delete anything you don’t like and have an ai rewrite into whatever you want. Entire threads rewritten, a company can go back and have your entire post history can be changed in ways that might be legally compromising.

darkphotonstudio@beehaw.org · 4 months ago

I think people would have less issues with AI training if it was non-profit and for the common good. And there are open source AI projects, many in fact. But yeah, these deals by companies like this are sleazy.

NeatNit@discuss.tchncs.de · 4 months ago

OpenAI was literally that until it wasn’t

darkphotonstudio@beehaw.org · 4 months ago

I don’t think OpenAI actually released any FOSS code, did they?

Skull giver@popplesburger.hilciferous.nl · 4 months ago

Up until GPT3 they were quite open. When GPTs became good, they started claiming sharing the models would be risky and that there were ethical problems and that they would safekeep the technology. I believe they were even sued by one of their investors for sticking to their open mission at some point.

The source code they would provide would be pretty useless to most people anyway, unless you have a couple million laying around to spend on GPUs.

Plenty of AI companies do what OpenAI did, without ever sharing any models or writing any papers. We only hear about the open stuff. We see tons of open source AI stuff on Github that’s all mostly based on research by either Google or OpenAI. All the Llama stuff exists only because Facebook shared their model (accidentally). All of this stuff is mostly open, even if it’s not FOSS.

Compare that to what companies are doing internally. You bet data brokers and other shady shits are sucking up as much data as they can get their hand on to train their own, specialised AI, free from the burdens of “as an LLM I can’t do that”.

delirious_owl@discuss.online · 4 months ago

This isn’t really comparable to reddit, since users can just send a request to SO for all the content. Reddit locking down the API meant we lost access to our content.

FenrirIII@lemmy.world · 4 months ago

If you get something for free, you are the product

Modern_medicine_isnt@lemmy.world · 4 months ago

So what is the stack overflow replacement?

Weslee@lemmy.world · 4 months ago

Maybe https://www.codidact.com/

katy ✨@lemmy.blahaj.zone · 4 months ago

that would be great if they federated and implemented activitypub/atproto!

katy ✨@lemmy.blahaj.zone · 4 months ago

let’s all go back to experts exchange

dukatos@lemm.ee · 4 months ago

Expert sex change?

davel [he/him]@lemmy.ml · 4 months ago

Good luck with the deleting. It often just means UPDATE comments SET is_deleted = 1 WHERE ID = 666;.

chiisana@lemmy.chiisana.net · 4 months ago

There was similar things done on Reddit during the big exit. I doubt it achieved what people expected it to achieve. Even if they’re not visible externally, I’m sure they can easily access (thereby make deals to license) the data out of their backend / backup; just a matter of how hard they want to try (hint: it’s really not very hard).

duncesplayed@lemmy.one · 4 months ago

Yeah during the reddit exodus, people were recommending to overwrite your comment with garbage before deleting it. This (probably) forces them to restore your comment from backup. But realistically they were always going to harvest the comments stored in backup anyway, so I don’t think it caused them any more work.

If anything, this probably just makes reddit’s/SO’s partnership more valuable because your comments are now exclusive to reddit’s/SO’s backend, and other companies can’t scrape it.

Lemongrab@lemmy.one · 4 months ago

It was to make the data inaccessible to general people, therefore removing the reason people visit reddit. Even if reddit could still get the data, regular people would be inconvenienced (in theory) and look somewhere else.

plz1@lemmy.world · 4 months ago

They are not deleting, they are editing. So the platform would have to undo those edits rather than just flipping the visibility flag.

paraphrand@lemmy.world · 4 months ago

And they are. 😞

Sibbo@sopuli.xyz · 4 months ago

Does GDPR apply to stackoverflow? Since my data there probably does not identify me as a person?

delirious_owl@discuss.online · 4 months ago

You van delete your data but I don’t think it magically makes derivative works disappear. Its licenses SA. This is good.