In my world, anything that isn't "identical to R's dplyr API but faster" just is...

cigrainger · on Dec 17, 2021

I've been working on a dataframe library for Elixir that's built on top of Polars and that's heavily influenced by dplyr if you're interested in checking it out: https://github.com/elixir-nx/explorer

anko · on Dec 17, 2021

that's really cool, thanks!

pdeffebach · on Dec 17, 2021

DataFramesMeta.jl might be exactly what you are looking for then! The syntax is very close to dplyr, but has performance benefits thanks to Julia.

Here is a tutorial for those familiar with dplyr: https://juliadata.github.io/DataFramesMeta.jl/stable/dplyr/

fault1 · on Dec 17, 2021

DataFramesMeta is great!

But I always get confused by the name. Since DataFrames.jl is lower level shouldn't that be DataFramesBase.jl and the meta package be DataFrames.jl?

pdeffebach · on Dec 17, 2021

Yes it absolutely needs a new name!

Hasnep · on Dec 17, 2021

The convention in Julia is that a package that defines a type Abc is called Abcs.jl. Also, DataFrames.jl provides its own manipulation functions which DataFramesMeta is a wrapper around using metaprogramming, hence the name.

fault1 · on Dec 17, 2021

That makes sense, but I still think the meta name is confusing. I mean, as a user the fact that it was implemented using metaprogramming techniques has no bearing, it's an implementation detail. Actually, my brain never thought to associate meta in this context with metaprogramming. Makes sense in hindsight, but still confusing.

But still, I can't really come up with a nicer name. VerbalDataFrames to match the dplyr verbs idiom?

Hasnep · on Dec 20, 2021

Yeah, I agree it's not a good name. I think using the word macro instead of meta is more useful to the user, something like DataFramesMacros.jl.

davnn · on Dec 17, 2021

One of the piping macro packages + dataframes.jl works as well.

vavooom · on Dec 17, 2021

Also worth plugging the advanced speed of R’s data.table package which continues to trump dplyr to this day. The syntax is also more compact and straightforward once you understand how to query data with it.

civilized · on Dec 17, 2021

I don't like it as much as dplyr and I stand behind that. It's too "clever", especially with respect to joins.

Everything is fine "once you understand how to use it", even assembly code, but it's not equally expressive or intuitive. So I don't value data.table speed that much, it's my thinking and typing speed that's usually the limiting factor. I would always recommend dplyr over anything else for someone learning how to use tables.

I also can't help but point out that data.table has the worst first FAQ answer I've ever seen in software documentation: https://cran.r-project.org/web/packages/data.table/vignettes.... Just astonishingly bad. I could write an essay about the unique and diverse ways in which this thing is both incredibly poorly organized and deeply user-hostile.

But if you truly have a need for speed on large datasets, it may be for you.

nojito · on Dec 17, 2021

The FAQ isn't for new data.table users.

https://rdatatable.gitlab.io/data.table/articles/datatable-i...

Which is why it isn't really linked anywhere else.

minimaxir · on Dec 17, 2021

There is an official dplyr extension that leverages data.table: https://dtplyr.tidyverse.org/

vatican_banker · on Dec 17, 2021

In what way data.table trumps dplyr? Genuinely interested in knowing.

While data.table is faster than dplyr, data manipulations with data.table are difficult to read/understand/maintain.

dplyr also grew into a full-fledge list of libraries to work on data-related projects (the tidyverse). These libraries are _very_ well thought out and enables productivity with minimal learning curve [anecdotal]

extr · on Dec 17, 2021

the easiest way to think about it is data.table is for people who are doing a lot of exploratory data analysis every day. If you're doing the same thing over and over, it makes sense to create a DSL specific to that task and optimize the hell out of it. that's basically data.table.

dplyr is for everyone else, and it's great and important that it exists, because most people don't want to (and shouldn't need to) learn a DSL to do some basic filtering/sorting/grouping of 100mb of data.

RcrdBrt · on Dec 17, 2021

Anecdotal data: I found that data.table ingestion speed with fread() trumps absolutely everything else

civilized · on Dec 17, 2021

This observation is pretty widely shared.

nojito · on Dec 17, 2021

The difficulty to read is a misnomer.

    Dt[rows, columns, groups]

Assuming your dplyr code is generally split apply combine, the dt version is shorter and easier to reason around.

https://atrebas.github.io/post/2019-03-03-datatable-dplyr/

civilized · on Dec 17, 2021

I disagree. Doing data manipulation one action at a time in a piped sequence is easiest to reason about because the state right before you apply a new operation is always clear.

data.table, on the other hand, is a fancy clever gadget with many knobs and buttons you have to turn and press just so to get the desired result. It's only simple if all you do is filter, group by, and summarize.

To illustrate, let's look at what you have to do in data.table in order to achieve the equivalent of a grouped filter in dplyr (from the dtplyr translation vignette):

dplyr:

  df %>% 
    group_by(a) %>%
    filter(b < mean(b))

data.table:

  DT[DT[, .I[b < mean(b)],
        by = .(a)]$V1]

Compared to the simple, declarative feel of the dplyr, there's a lot of weird stuff going on in the data.table version. You have to put DT inside itself? What is .I? Where did V1 come from? Janky stuff.

(And yes I know precisely what is going on in the data.table version, I just think it's ugly and illustrates my point about composability and legibility extremely well.)

The reason data.table has all these independent knobs is because it wants you to cram your entire query into a single command, so it can optimize the query more easily and squeeze every drop of performance. NOT because it's more understandable, because it isn't.

The best of both worlds -- an optimizable query and one-action-at-a-time syntax -- can be achieved with a lazy system like Apache Spark or dtplyr.

nojito · on Dec 19, 2021

Your code golf example makes no sense.

    B_mean <- dt[, mean(b)]

    Dt[b<b_mean, by=.(a)]

Unlike the dplyr solution the dt solution is robust and we can independently test to make sure the mean of b makes sense.

The very easy to reason around concept of dt[rows, columns, groups] makes the code extremely clear.

Your translation example is absolutely bonkers because it’s trying to pigeonhole the simplicity of dt into the nonsense that is dplyr.

temp8964 · on Dec 17, 2021

The easiest to understand data frame API syntax is SQL: select cols from df where rows match condition group by grouping cols.

data.table syntax is just like that. But less verbose. Plus super fast. No reason to not love it.

civilized · on Dec 17, 2021

I agree that if that's all you do with data, data.table makes it easy.

gullywhumper · on Dec 17, 2021

One plus with dplyr is that I can share the code with non-R programmers (and even some non-programmers) and they can follow what is happening pretty easily, while data.table takes some more explanation.

extr · on Dec 17, 2021

dplyr API is not ideal in my experience. Overly verbose and confusing group/melt/cast operators. I much much prefer data.table. In your edit you mention concision, data.table is practically the platonic ideal of that!

civilized · on Dec 17, 2021

Meh. Some people will never stop using Perl or APL because you can get anything done in five random characters (well, anything the language is optimized to express, everything else is a lot harder). I respect it but it's not for me.

The tidyverse has the most advanced and intuitive versions of all the things you mention IMO. It has evolved a lot in the past couple years and your impressions of it could be out of date.

There is also the dtplyr backend for data.table speed with dplyr syntax, but I don't even bother because dplyr is almost always fast enough for me.

extr · on Dec 17, 2021

I did go check out what's new in the tidyverse after your comment and was pleased to see new functions like pivot_wider and pivot_longer replacing the extremely confusing mess of spread and unite. So it's great to see the ecosystem evolving toward better usability. However I would hardly count it as a victory when late in the game you have to change the API for some core data manipulation functions because you made them too confusing the first time around.

I think you are also maybe assuming everyone has the same use-case as you for data manipulation libraries. If you are coming from a non-programming context and picking up R for the first time, no doubt tidyverse is the way to do that. The verbosity is obviously a benefit if you're having to read someone else's code and are not interested in learning a DSL just to understand what columns are being filtered on or dropped or whatever.

But if you are doing data analysis full time and are writing thousands of lines of throwaway EDA code a week, most of it only to be seen by yourself, the concision and speed that data.table offers is basically second to none, in any language. Rapid iteration for you personally is the point. Less typing is good, because you're trying to move as fast as possible to explore hypotheses. Execution speed on medium sized data is important, because a few extra seconds on every run matters a lot when you are running 500 micro-batches of analysis code a day. And as the h2o benchmarks show, data.table is still quite a bit faster than dplyr. Obviously not everyone needs the speed, but a lot of us do!

civilized · on Dec 17, 2021

It's my hypothesis that pretty much everyone who loves data.table is a finance/trading type person who as you say needs to quickly write tons of throwaway exploratory code to analyze large stock price datasets or the like.

I would probably prefer data.table to dplyr in that use case as well. The creator of data.table clearly comes from that background and wrote it for those kind of workloads.

I will also admit that the latest data.table tutorials suggest a lot of improvement over time. data.table made some truly WTF decisions in its early versions and has backtracked on all of it. The join API is much more reasonable now and it supports non-equijoins, which for many people could be the decider vs dplyr just by itself.

The dplyr API has only evolved so much because Hadley set insanely high standards for how powerful and intuitive it should be. So personally I don't count it against them that they didn't get it 100% right the first time.... even though I personally have been burned a couple times by all the changes. I think it's worth it for what they have achieved.

Not that it's all roses. Tidyverse stack traces have become kind of horrible. They're dozens and dozens of layers deep and you have to be pretty experienced to sift through the noise. I'm an old hand and know how to deal with it, which is probably the way a lot of people feel about their favorite table package... even gag Pandas.

extr · on Dec 17, 2021

In my case your guess is completely correct, as I learned data.table in a financial company analyzing large insurance datasets :)

I apologize if I came across as a hardliner. Sometimes I feel like data.table is not well advertised for how capable it is, so I will defend the library if given the chance. Surprisingly how many "big data workloads" you can replace with a high memory cloud instance and a simple data.table script. Cheers to using the right tool for the right job.

civilized · on Dec 17, 2021

My background is (somewhat) similar but for some reason I've been a stick in the mud keeping on with dplyr. Somehow I feel the verbosity helps make sure I get things right. And a lot of my code is not throwaway, I have pipelines I have to maintain and teach others to maintain. (There's a guy on my team replacing some of my uglier code with simple data.table non-equijoins and I can't even argue with him as I mentioned earlier!)

I'm glad you're having success with data.table and I totally support you against the forces of evil trying to make us use Spark or whatever is the latest big data nonsense to analyze a few million rows.

It's like how we may not agree what project management tool to use but we all agree it's not JIRA :)

I may end up switching to data.table after all. I find dplyr easier to reason about for complex production pipelines that need to be precisely "correct", but all the package developers are raising the bar all the time and data.table may be OK for this use case by now. I definitely do feel the pain point of dplyr slowness here and there.

nuq · on Dec 17, 2021

True that data.table is much simpler and faster one of the reasons I switched from dplyr to data.table

ttymck · on Dec 17, 2021

Is there dplyr API for pandas? That would seem like a very valuable "translation" layer for transitioning or cross language devs. Maybe there is some language barrier to implementing an elegant/faithful version in python?

civilized · on Dec 17, 2021

There have been a number of interesting attempts at this. They have names like dplython, And haven't really caught on widely. Python isn't really the best language to build a dplyr-like API in since both the structure and the culture of the language are against metaprogramming and nonstandard evaluation to create DSLs.

otsaloma · on Dec 17, 2021

Agreed, dplyr is great.

I built my own data frame implementation on top of NumPy specifically trying to accomplish a better API, similar to dplyr. It's not exactly the same naming or operations, but should feel familiar and much simpler and consistent than Pandas. And no indexes or axes.

Having done this, a couple notes on what will unavoidably differ in Python

* It probably makes more sense in Python to use classes, so method chaining instead of function piping. I wish one could syntactically skip enclosing parantheses in Python though, method chains look a bit verbose.

* Python doesn't have R's "non-standard evaluation", so you end up needing lambda functions for arguments in method chains and group-wise aggregation etc. I'd be interested if someone has a better solution.

* NumPy (and Pandas) is still missing a proper missing value (NA). It's a big pain to try to work around that.

https://github.com/otsaloma/dataiter

matham · on Dec 17, 2021

>NumPy (and Pandas) is still missing a proper missing value (NA).

But if it's missing a missing value, doesn't that mean that it has a proper missing value?

I'll let myself out now...

_Wintermute · on Dec 17, 2021

You're clearly on the dplyr bandwagon, but as someone who wrote R code for about 10 years before dplyr came along, and saw the direction the language was going, it's the reason I now mainly use python. I just could not put up with the non-standard evaluation so everything ends up being a 100+ line script instead of composable functions, and breaking API changes every 6 months.

pietroppeter · on Dec 17, 2021

still very small yet, but Nim's dataframe library (datamancer) has a dplyr api (and it is fast): https://github.com/SciNim/Datamancer

Being in Nim, it will be easy also to add sweet DSLs.

BiteCode_dev · on Dec 17, 2021

You don't need to write "import pandas; pandas.bla()", you can do "from pandas import *; anything_in_pandas()" if you want quick and dirty.

FridgeSeal · on Dec 17, 2021

And if you want you and your team mates to hate you when they need to work on your code later, and you’ve got random, mystery functions all over the place.

cabalamat · on Dec 17, 2021

> dplyr

Ths s lbrry whs nm nds mr vwls. F m tlkng t smn, hw m sppsd t prnc t?