Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

So how much space would the entire english wikipedia take up on this filesystem, I wonder.


I've got a (relatively old) snapshot of the English Wikipedia that I'm using for testing. The snapshot is around 200 GiB in 14,000,000 files and compresses down to an 11 GiB DwarFS image.


I guess it's without the pictures then ? Because if I compare with the zim file format (which is optimized for this use-case) https://kiwix.org/en/what-is-the-size-of-wikipedia/ I read "As of October 2022, the Full English Wikipedia (ca. 6.5 million articles), with images will use up 91GB of storage space (German and French, the second-largest: 36 GB). (...) If you can do without the images (what we call the nopic version), then you are down to 46 GB."


Correct, there are no images in the data except for 68 PNGs. It's just HTML files.


how it's possible that a bunch of html files would add up to 200gb? is it because of some kind of overhead?

would maybe a database dump be smaller?


Well, "a bunch" is an understatement, I bet they have a bit more than just a bunch! It does pass a sniff test, since from https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia:

>As of May 2015, the current version of the English Wikipedia article / template / redirect text was about 51 GB uncompressed in XML format.

Compressed data at the same time was 11.5 GB. And that's data from 9 years ago, and just English Wikipedia.

For comparison, I collect leaked password dumps and they (combined, after deduplication) go into hundreds of GBs too. And that's for just username:password lines, not even text.


That's a substantial reduction. Thanks for testing this!


It's ever so slightly smaller than a .tar.xz of the same data. The main difference being that you don't have to fully extract it in order to access the data.


Only one way to find out :-)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: