The problem isn't that Stack Overflow is *allowing* people to scrape the content...

webstrand · on May 9, 2024

Isn't the data publicly available? https://archive.org/details/stackexchange

mattstir · on May 9, 2024

The publicly-available archives released by Stack Exchange are updated roughly quarterly and have the attribution requirements as specified by CC BY-SA + the Stack Exchange ToS.

The article makes it sound like OpenAI is using the API though, rather than the archives. The API and live sites forbid scraping within the acceptable use policy, as seen here: https://stackoverflow.com/legal/acceptable-use-policy

hedora · on May 9, 2024

Given the CC license, and the fact that contributors can apparently code, they should scrape the content and be done.

Of course, that’d mean bypassing the scraper blocker. This article is a decent starting point:

https://stackoverflow.com/questions/66413511/how-to-avoid-be...

lamontcg · on May 9, 2024

> And, incidentally, passing zero of that money on to the people who actually created the content.

I mean that is basically SO's entire business model.

People do tons of work for free and SO runs the service and monetizes it.

theendisney · on May 8, 2024

I dont get how you can release something under anything other than all rights reserved without identification. We need to be able to persecute you in case you are not the author. Or is it that i may republish anything under any license?? It could be that the platform licences it in the toss but with cc are they not obligated to make it available without obstructie?

Repulsion9513 · on May 9, 2024

Prosecution and persecution are two different things. Persecuting anyone is not a good time :)

If you need to prosecute the person, there are established procedures for that: DMCA, or ultimately a lawsuit over the infringement. That you didn't identify yourself publicly on the site does not make that impossible. In fact the point of the DMCA was to make it easier to handle this - because if the provider doesn't comply with your DMCA, you can sue the provider.

shkkmo · on May 9, 2024

Requiring indentification to publish so that copyright is protected would be massive overreach and this sort of thinking is why I think copyright is a dangerous concept that needs to be sharply curtailed, not expanded to cover AI training.

In practice, the safest course is to not use content from untrustworthy sources in ways that require a license (aka in ways that are not fair use in your applicable jurisdictions).

theendisney · on May 9, 2024

I think by default you just cant use things? Who thought that was a great idea i dont know. We must be missing an enourmous chunk of progress.

Every juristiction its own idea of fair use? Thats just hilarious?

I never really thought about peoples privacy either but at first glance you seem to be right.

Do you have any solution to the puzzle? People are quite attached to the concept and many build their house on this soil. Appeal to tradition?