Finding a standard dataset format for machine learning (2020)

web007 · on March 17, 2021

I've been looking for the answer to this question for a while w/r/t large datasets with thousands or millions of files. I'm not concerned as much with programmatic usability as saving my FS from being overwhelmed both from indexing and allocating so many distinct entities. Doing any bookkeeping on a directory of a million files - even hierarchically organized - tends to be very taxing, versus storing the same amount of data in a single binary format is usually simple to manage.

I'm surprised that ZIP wasn't (isn't?) a contender. Tooling exists everywhere, it lets you mix and match data types, and it seems to hit nearly every point in their comparison. The only point I'm not sure about is "Incremental reads/writes", as it keeps a central directory structure at the end of the file. Incremental reads would need to seek first then could read randomly, and writes are slightly more complicated having to rewrite the entire directory structure to append.

davnn · on March 17, 2021

I would not see Zip as a file format, it's just compressed version of some other format. You therefore inherit all the pros and benefits of the compressed file, which might be columnar storage (e.g. Parquet) or row-based (e.g. Avro), but those "modern" formats have compression built in, no need for zipping.

hikarudo · on March 17, 2021

I've been looking for a solution to the same problem.

So far we've been using a single server for storage, and developers rsync whatever they need locally. A few million images. Training is usually done locally, but when we do use cloud training, we upload just the dataset we need to S3 and use EC2.

We're a small team, and currently considering moving to a cloud-first infrastructure. The idea is to store each image in S3, and all metadata (annotations etc) in Postgres or something like that, maybe using Postgres's JSON/JSONB feature.

I'd appreciate any thoughts and pointers on handling datasets with a few million images.

7952 · on March 17, 2021

Why not SQLite? You get an efficient file format, and can easily add metadata to each file. That is how the Geopackage format works for storing tile caches for mapping data. There are often millions of small files that don't work well on a file system. I think there is even a standard for storing directory structures on SQLite.

web007 · on March 17, 2021

SQLite is one of the options evaluated in the article. I haven't considered it because I feel that it's a proprietary format, and also because I haven't read up on its improvements in the past ~decade. My view is still that SQLite corrupts easily because people were treating it like a proper database and doing multithreaded reading/writing, but in reality it's a different thing that doesn't quite work that way.

hatmatrix · on March 17, 2021

SQLite: "It supports only 2000 columns"

You can change SQLITE_MAX_VARIABLE_NUMBER or SQLITE_MAX_COLUMN variables before compiling, no?

simonw · on March 17, 2021

You could also take advantage of SQLite's JSON support and store as many "columns" as you like in a single column.

TheAdamAndChe · on March 17, 2021

Wouldn't that be inefficient? The software would then have to parse a json string after the sqlite query.

ekianjo · on March 17, 2021

Then that wouldn't be very portable at all, would it?

teleforce · on March 17, 2021

There is a follow up article proposing a solution by the TileDB team [1].

[1]https://tiledb.com/blog/tiledb-as-the-data-engine-for-machin...

ekianjo · on March 17, 2021

Pretty good summary. I would say the general rule at this stage is to move away from CSV as much as possible and use either Arrow/Feather or Parquet depending on the use case and the library support you need.

rich_sasha · on March 17, 2021

There is also xarray, I never used it myself, but was one of the options I looked at.

What put me off is seemingly close integration with pandas which I was trying to avoid.

http://xarray.pydata.org/en/stable/why-xarray.html

hazbot · on March 17, 2021

The on-disk format you would probably want to use with xarray is netCDF - which is generally n-dimensional gridded data, and some metadata about what those dimensions are, the units of the data, and other miscellanies.

One way to think of xarray is as a really nice in-memory representation of a netCDF file with some pretty powerful methods for manipulating that data, in much the same way that you can think of DataFrames as a really nice in-memory representation of a csv/table with some powerful methods.

I use xarray for working with satellite imagery and weather model grids - it's a million times better than the fragile MATLAB code I used to use.

boublepop · on March 17, 2021

This reads a bit like the dismissal of SQLite comes down to a desire not to do any data modeling at all, which seems a bit silly in a discussion about standard formats and schemas. Obviously you shouldn’t put your data into a single hard coded “column per feature” table. So the limitation of “only” 2000 columns is really just there to nudge you in a better direction.

ekianjo · on March 17, 2021

You put it this way because that's how you usually train models. If you don't train models at all on your data, then for sure you don't need to care about such limitations.

hermitcrab · on March 17, 2021

All credit to them for not suggesting their own format or storing the data values in (shudder) JSON.

villgax · on March 17, 2021

We also need a standard for storing multi-tenant SaaS apps as well, apart from niches core things should have a pretty standard format for access/storing data etc