Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

In my world, anything that isn't "identical to R's dplyr API but faster" just isn't quite worth switching for. There's absolutely no contest: dplyr has the most productive API and that matters to me more than anything else. But I'm glad to see Polars moves away from the kludgey sprawl of the Pandas API towards the perfection of dplyr... while also being blazingly fast!

Now just mix in a bit of DSL so people aren't obligated* to write lame boilerplate like "pandas.blahblah" or "polars.blahblah" just to reference a freaking column, and you're there!

*If you like the boilerplate for "production robustness" or whatever, go wild, but analysts and scientists benefit from the option to write more concisely.



I've been working on a dataframe library for Elixir that's built on top of Polars and that's heavily influenced by dplyr if you're interested in checking it out: https://github.com/elixir-nx/explorer


that's really cool, thanks!


DataFramesMeta.jl might be exactly what you are looking for then! The syntax is very close to dplyr, but has performance benefits thanks to Julia.

Here is a tutorial for those familiar with dplyr: https://juliadata.github.io/DataFramesMeta.jl/stable/dplyr/


DataFramesMeta is great!

But I always get confused by the name. Since DataFrames.jl is lower level shouldn't that be DataFramesBase.jl and the meta package be DataFrames.jl?


Yes it absolutely needs a new name!


The convention in Julia is that a package that defines a type Abc is called Abcs.jl. Also, DataFrames.jl provides its own manipulation functions which DataFramesMeta is a wrapper around using metaprogramming, hence the name.


That makes sense, but I still think the meta name is confusing. I mean, as a user the fact that it was implemented using metaprogramming techniques has no bearing, it's an implementation detail. Actually, my brain never thought to associate meta in this context with metaprogramming. Makes sense in hindsight, but still confusing.

But still, I can't really come up with a nicer name. VerbalDataFrames to match the dplyr verbs idiom?


Yeah, I agree it's not a good name. I think using the word macro instead of meta is more useful to the user, something like DataFramesMacros.jl.


One of the piping macro packages + dataframes.jl works as well.


Also worth plugging the advanced speed of R’s data.table package which continues to trump dplyr to this day. The syntax is also more compact and straightforward once you understand how to query data with it.


I don't like it as much as dplyr and I stand behind that. It's too "clever", especially with respect to joins.

Everything is fine "once you understand how to use it", even assembly code, but it's not equally expressive or intuitive. So I don't value data.table speed that much, it's my thinking and typing speed that's usually the limiting factor. I would always recommend dplyr over anything else for someone learning how to use tables.

I also can't help but point out that data.table has the worst first FAQ answer I've ever seen in software documentation: https://cran.r-project.org/web/packages/data.table/vignettes.... Just astonishingly bad. I could write an essay about the unique and diverse ways in which this thing is both incredibly poorly organized and deeply user-hostile.

But if you truly have a need for speed on large datasets, it may be for you.


The FAQ isn't for new data.table users.

https://rdatatable.gitlab.io/data.table/articles/datatable-i...

Which is why it isn't really linked anywhere else.


There is an official dplyr extension that leverages data.table: https://dtplyr.tidyverse.org/


In what way data.table trumps dplyr? Genuinely interested in knowing.

While data.table is faster than dplyr, data manipulations with data.table are difficult to read/understand/maintain.

dplyr also grew into a full-fledge list of libraries to work on data-related projects (the tidyverse). These libraries are _very_ well thought out and enables productivity with minimal learning curve [anecdotal]


the easiest way to think about it is data.table is for people who are doing a lot of exploratory data analysis every day. If you're doing the same thing over and over, it makes sense to create a DSL specific to that task and optimize the hell out of it. that's basically data.table.

dplyr is for everyone else, and it's great and important that it exists, because most people don't want to (and shouldn't need to) learn a DSL to do some basic filtering/sorting/grouping of 100mb of data.


Anecdotal data: I found that data.table ingestion speed with fread() trumps absolutely everything else


This observation is pretty widely shared.


The difficulty to read is a misnomer.

    Dt[rows, columns, groups]
Assuming your dplyr code is generally split apply combine, the dt version is shorter and easier to reason around.

https://atrebas.github.io/post/2019-03-03-datatable-dplyr/


I disagree. Doing data manipulation one action at a time in a piped sequence is easiest to reason about because the state right before you apply a new operation is always clear.

data.table, on the other hand, is a fancy clever gadget with many knobs and buttons you have to turn and press just so to get the desired result. It's only simple if all you do is filter, group by, and summarize.

To illustrate, let's look at what you have to do in data.table in order to achieve the equivalent of a grouped filter in dplyr (from the dtplyr translation vignette):

dplyr:

  df %>% 
    group_by(a) %>%
    filter(b < mean(b))
data.table:

  DT[DT[, .I[b < mean(b)],
        by = .(a)]$V1]
Compared to the simple, declarative feel of the dplyr, there's a lot of weird stuff going on in the data.table version. You have to put DT inside itself? What is .I? Where did V1 come from? Janky stuff.

(And yes I know precisely what is going on in the data.table version, I just think it's ugly and illustrates my point about composability and legibility extremely well.)

The reason data.table has all these independent knobs is because it wants you to cram your entire query into a single command, so it can optimize the query more easily and squeeze every drop of performance. NOT because it's more understandable, because it isn't.

The best of both worlds -- an optimizable query and one-action-at-a-time syntax -- can be achieved with a lazy system like Apache Spark or dtplyr.


Your code golf example makes no sense.

    B_mean <- dt[, mean(b)]

    Dt[b<b_mean, by=.(a)]
Unlike the dplyr solution the dt solution is robust and we can independently test to make sure the mean of b makes sense.

The very easy to reason around concept of dt[rows, columns, groups] makes the code extremely clear.

Your translation example is absolutely bonkers because it’s trying to pigeonhole the simplicity of dt into the nonsense that is dplyr.


The easiest to understand data frame API syntax is SQL: select cols from df where rows match condition group by grouping cols.

data.table syntax is just like that. But less verbose. Plus super fast. No reason to not love it.


I agree that if that's all you do with data, data.table makes it easy.


One plus with dplyr is that I can share the code with non-R programmers (and even some non-programmers) and they can follow what is happening pretty easily, while data.table takes some more explanation.


dplyr API is not ideal in my experience. Overly verbose and confusing group/melt/cast operators. I much much prefer data.table. In your edit you mention concision, data.table is practically the platonic ideal of that!


Meh. Some people will never stop using Perl or APL because you can get anything done in five random characters (well, anything the language is optimized to express, everything else is a lot harder). I respect it but it's not for me.

The tidyverse has the most advanced and intuitive versions of all the things you mention IMO. It has evolved a lot in the past couple years and your impressions of it could be out of date.

There is also the dtplyr backend for data.table speed with dplyr syntax, but I don't even bother because dplyr is almost always fast enough for me.


I did go check out what's new in the tidyverse after your comment and was pleased to see new functions like pivot_wider and pivot_longer replacing the extremely confusing mess of spread and unite. So it's great to see the ecosystem evolving toward better usability. However I would hardly count it as a victory when late in the game you have to change the API for some core data manipulation functions because you made them too confusing the first time around.

I think you are also maybe assuming everyone has the same use-case as you for data manipulation libraries. If you are coming from a non-programming context and picking up R for the first time, no doubt tidyverse is the way to do that. The verbosity is obviously a benefit if you're having to read someone else's code and are not interested in learning a DSL just to understand what columns are being filtered on or dropped or whatever.

But if you are doing data analysis full time and are writing thousands of lines of throwaway EDA code a week, most of it only to be seen by yourself, the concision and speed that data.table offers is basically second to none, in any language. Rapid iteration for you personally is the point. Less typing is good, because you're trying to move as fast as possible to explore hypotheses. Execution speed on medium sized data is important, because a few extra seconds on every run matters a lot when you are running 500 micro-batches of analysis code a day. And as the h2o benchmarks show, data.table is still quite a bit faster than dplyr. Obviously not everyone needs the speed, but a lot of us do!


It's my hypothesis that pretty much everyone who loves data.table is a finance/trading type person who as you say needs to quickly write tons of throwaway exploratory code to analyze large stock price datasets or the like.

I would probably prefer data.table to dplyr in that use case as well. The creator of data.table clearly comes from that background and wrote it for those kind of workloads.

I will also admit that the latest data.table tutorials suggest a lot of improvement over time. data.table made some truly WTF decisions in its early versions and has backtracked on all of it. The join API is much more reasonable now and it supports non-equijoins, which for many people could be the decider vs dplyr just by itself.

The dplyr API has only evolved so much because Hadley set insanely high standards for how powerful and intuitive it should be. So personally I don't count it against them that they didn't get it 100% right the first time.... even though I personally have been burned a couple times by all the changes. I think it's worth it for what they have achieved.

Not that it's all roses. Tidyverse stack traces have become kind of horrible. They're dozens and dozens of layers deep and you have to be pretty experienced to sift through the noise. I'm an old hand and know how to deal with it, which is probably the way a lot of people feel about their favorite table package... even gag Pandas.


In my case your guess is completely correct, as I learned data.table in a financial company analyzing large insurance datasets :)

I apologize if I came across as a hardliner. Sometimes I feel like data.table is not well advertised for how capable it is, so I will defend the library if given the chance. Surprisingly how many "big data workloads" you can replace with a high memory cloud instance and a simple data.table script. Cheers to using the right tool for the right job.


My background is (somewhat) similar but for some reason I've been a stick in the mud keeping on with dplyr. Somehow I feel the verbosity helps make sure I get things right. And a lot of my code is not throwaway, I have pipelines I have to maintain and teach others to maintain. (There's a guy on my team replacing some of my uglier code with simple data.table non-equijoins and I can't even argue with him as I mentioned earlier!)

I'm glad you're having success with data.table and I totally support you against the forces of evil trying to make us use Spark or whatever is the latest big data nonsense to analyze a few million rows.

It's like how we may not agree what project management tool to use but we all agree it's not JIRA :)

I may end up switching to data.table after all. I find dplyr easier to reason about for complex production pipelines that need to be precisely "correct", but all the package developers are raising the bar all the time and data.table may be OK for this use case by now. I definitely do feel the pain point of dplyr slowness here and there.


True that data.table is much simpler and faster one of the reasons I switched from dplyr to data.table


Is there dplyr API for pandas? That would seem like a very valuable "translation" layer for transitioning or cross language devs. Maybe there is some language barrier to implementing an elegant/faithful version in python?


There have been a number of interesting attempts at this. They have names like dplython, And haven't really caught on widely. Python isn't really the best language to build a dplyr-like API in since both the structure and the culture of the language are against metaprogramming and nonstandard evaluation to create DSLs.


Agreed, dplyr is great.

I built my own data frame implementation on top of NumPy specifically trying to accomplish a better API, similar to dplyr. It's not exactly the same naming or operations, but should feel familiar and much simpler and consistent than Pandas. And no indexes or axes.

Having done this, a couple notes on what will unavoidably differ in Python

* It probably makes more sense in Python to use classes, so method chaining instead of function piping. I wish one could syntactically skip enclosing parantheses in Python though, method chains look a bit verbose.

* Python doesn't have R's "non-standard evaluation", so you end up needing lambda functions for arguments in method chains and group-wise aggregation etc. I'd be interested if someone has a better solution.

* NumPy (and Pandas) is still missing a proper missing value (NA). It's a big pain to try to work around that.

https://github.com/otsaloma/dataiter


>NumPy (and Pandas) is still missing a proper missing value (NA).

But if it's missing a missing value, doesn't that mean that it has a proper missing value?

I'll let myself out now...


You're clearly on the dplyr bandwagon, but as someone who wrote R code for about 10 years before dplyr came along, and saw the direction the language was going, it's the reason I now mainly use python. I just could not put up with the non-standard evaluation so everything ends up being a 100+ line script instead of composable functions, and breaking API changes every 6 months.


still very small yet, but Nim's dataframe library (datamancer) has a dplyr api (and it is fast): https://github.com/SciNim/Datamancer

Being in Nim, it will be easy also to add sweet DSLs.


You don't need to write "import pandas; pandas.bla()", you can do "from pandas import *; anything_in_pandas()" if you want quick and dirty.


And if you want you and your team mates to hate you when they need to work on your code later, and you’ve got random, mystery functions all over the place.


> dplyr

Ths s lbrry whs nm nds mr vwls. F m tlkng t smn, hw m sppsd t prnc t?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: