The problem isn't that Stack Overflow is allowing people to scrape the content. The problem is that Stack Overflow is preventing some people from scraping the content, in order to collect money from others. And, incidentally, passing zero of that money on to the people who actually created the content.
(Nearly) none of the people who are presently pissed off would have complained if Stack Overflow had continued to allow all comers to scrape the content and train LLMs on it, nor if Stack Overflow had released the entire finished collection of content under the same CC-BY-SA license that was demanded of each contributor.
With the OpenAI partnership, and similar shenanigans leading up to it, Stack Overflow is relying on obscure technicalities to violate the essential spirit of the original deal.
The publicly-available archives released by Stack Exchange are updated roughly quarterly and have the attribution requirements as specified by CC BY-SA + the Stack Exchange ToS.
The article makes it sound like OpenAI is using the API though, rather than the archives. The API and live sites forbid scraping within the acceptable use policy, as seen here: https://stackoverflow.com/legal/acceptable-use-policy
I dont get how you can release something under anything other than all rights reserved without identification. We need to be able to persecute you in case you are not the author. Or is it that i may republish anything under any license?? It could be that the platform licences it in the toss but with cc are they not obligated to make it available without obstructie?
Prosecution and persecution are two different things. Persecuting anyone is not a good time :)
Why, if you're not allowed to release under a license, should you be able to release all rights reserved (which can still be a copyright violation!)?
If you need to prosecute the person, there are established procedures for that: DMCA, or ultimately a lawsuit over the infringement. That you didn't identify yourself publicly on the site does not make that impossible. In fact the point of the DMCA was to make it easier to handle this - because if the provider doesn't comply with your DMCA, you can sue the provider.
Requiring indentification to publish so that copyright is protected would be massive overreach and this sort of thinking is why I think copyright is a dangerous concept that needs to be sharply curtailed, not expanded to cover AI training.
In practice, the safest course is to not use content from untrustworthy sources in ways that require a license (aka in ways that are not fair use in your applicable jurisdictions).
(Nearly) none of the people who are presently pissed off would have complained if Stack Overflow had continued to allow all comers to scrape the content and train LLMs on it, nor if Stack Overflow had released the entire finished collection of content under the same CC-BY-SA license that was demanded of each contributor.
With the OpenAI partnership, and similar shenanigans leading up to it, Stack Overflow is relying on obscure technicalities to violate the essential spirit of the original deal.