Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The problem isn't that Stack Overflow is allowing people to scrape the content. The problem is that Stack Overflow is preventing some people from scraping the content, in order to collect money from others. And, incidentally, passing zero of that money on to the people who actually created the content.

(Nearly) none of the people who are presently pissed off would have complained if Stack Overflow had continued to allow all comers to scrape the content and train LLMs on it, nor if Stack Overflow had released the entire finished collection of content under the same CC-BY-SA license that was demanded of each contributor.

With the OpenAI partnership, and similar shenanigans leading up to it, Stack Overflow is relying on obscure technicalities to violate the essential spirit of the original deal.



Isn't the data publicly available? https://archive.org/details/stackexchange


The publicly-available archives released by Stack Exchange are updated roughly quarterly and have the attribution requirements as specified by CC BY-SA + the Stack Exchange ToS.

The article makes it sound like OpenAI is using the API though, rather than the archives. The API and live sites forbid scraping within the acceptable use policy, as seen here: https://stackoverflow.com/legal/acceptable-use-policy


Given the CC license, and the fact that contributors can apparently code, they should scrape the content and be done.

Of course, that’d mean bypassing the scraper blocker. This article is a decent starting point:

https://stackoverflow.com/questions/66413511/how-to-avoid-be...


> And, incidentally, passing zero of that money on to the people who actually created the content.

I mean that is basically SO's entire business model.

People do tons of work for free and SO runs the service and monetizes it.


I dont get how you can release something under anything other than all rights reserved without identification. We need to be able to persecute you in case you are not the author. Or is it that i may republish anything under any license?? It could be that the platform licences it in the toss but with cc are they not obligated to make it available without obstructie?


Prosecution and persecution are two different things. Persecuting anyone is not a good time :)

Why, if you're not allowed to release under a license, should you be able to release all rights reserved (which can still be a copyright violation!)?

If you need to prosecute the person, there are established procedures for that: DMCA, or ultimately a lawsuit over the infringement. That you didn't identify yourself publicly on the site does not make that impossible. In fact the point of the DMCA was to make it easier to handle this - because if the provider doesn't comply with your DMCA, you can sue the provider.


Requiring indentification to publish so that copyright is protected would be massive overreach and this sort of thinking is why I think copyright is a dangerous concept that needs to be sharply curtailed, not expanded to cover AI training.

In practice, the safest course is to not use content from untrustworthy sources in ways that require a license (aka in ways that are not fair use in your applicable jurisdictions).


I think by default you just cant use things? Who thought that was a great idea i dont know. We must be missing an enourmous chunk of progress.

Every juristiction its own idea of fair use? Thats just hilarious?

I never really thought about peoples privacy either but at first glance you seem to be right.

Do you have any solution to the puzzle? People are quite attached to the concept and many build their house on this soil. Appeal to tradition?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: