I am definitely inexperienced with R. The intention of this post was to highlight a key point of friction I encountered throughout my introduction to the language. Namely, I was introduced to R (by someone more senior than myself) as "data science in a runtime/IDE", but there quite finite boundaries where the conveniences of R stop and other tools begin.
What motivated me to write this post was the lack of discussion of R in this use case. I actually stumbled into using "Rscript -e" one liners while looking to do basic stats in a Linux CLI.
That being said, I still stand by my point. Taking data from "out in the wild" (log files, tarballs of images, unstructured text) and making use of it can be frustrating in R because cleaning up edge cases, removing unwanted data, and getting everything into the correct container/type often involved unintuitive chaining of function calls. This is coming from the perspective of someone who worked with Python/awk/sed prior to being exposed to R.
If you have good counterarguments, I'd be more than happy to hear them and addresss them in an end note in the post.
Have you ever used the tidyverse ecosystem? It's a set of syntacticly-compatible package authored chiefly by one man, which have evolved into a superset of R, if not an outright R2. (R's Scheme lineage makes it very hackable in this way).
I've working on python for my current project and constantly longing for R's syntax specifically for cleaning data. It's so much better.
Just last night, I wanted to perform an anti-join to find discrepancies between two data sets for debugging
One line in R:
anti_join(a_tibble, another_tibble, by = c("id_col1", "id_col2"))
Witchcraft in Pandas: (from stack overflow):
Method 1
# Identify what values are in TableB and not in TableA
key_diff = set(TableB.Key).difference(TableA.Key)
where_diff = TableB.Key.isin(key_diff)
# Slice TableB accordingly and append to TableA
TableA.append(TableB[where_diff], ignore_index=True)
Method 2:
rows = []
for i, row in TableB.iterrows():
if row.Key not in TableA.Key.values:
rows.append(row)
pd.concat([TableA.T] + rows, axis=1).T
I actually fired up R and re-imported the .csv data just for this. Took 15 secs while my colleague was still stuck debugging his own weird for loops.
pandas is such a shitshow. Every time i use it, im in a world of pain googling the finicky syntax for selecting columns, aggregating, filtering. I never touched R but pandas is so terrible for me. Nowadays it's either raw numpy arrays, plain sql or pyspark...
Great example! My first thought is that anti-join could be the basis of a "csv_diff $1 $2" shell function.
I have a hunch that there could be a really good follow-up post to this that takes these R hacks to the next level by extending it to work better with pre-structured text (where R really shines) and CSV files as arguments.
Nothing beats building stuff in software that just works with a simple small interface while being powerful!
May I ask what you use R for? I like learning languages for fun and I've been meaning to do some NBA analytics stuff. I'd love to have a REPL style interface to just do one-off math and analytics or short scripts. I haven't dug into the data science stuff yet but I'm disinterested in Python for some reason (maybe because I used to write Ruby for a living).
I started using R because I needed a better tool for formal statistical analysis. (Econometrics, didn't want to pay for STATA. Much better packages around variants of linear regressions + panel/time series data than python). Since, I've used it for some random scripts, data visualizations, and financial analysis (Josh Ulrich's packages + tidyquant).
R is a thoroughbred at doing data analysis from your laptop. It's bad at living on a server and operating any sort of app.
This is completely different functionality than the parent comment's code. It's only joining on one "column", not two, and it only returns the values in that "column", while the parent's code returns complete rows from the dataframe.
I’m coming at awk & sed from the other side having used R daily for a few years.
R is not made for cleaning up weird text files. Yes of course it can be done but that’s like the joke that everything is within walking distance if you have enough time. I recently had to use R to fix a 50gb csv where 10 of the columns were long json strings and needed to turn that into a data frame. That experience alone made me buy a book on awk and sed
I should have fleshed out my comment for sure, was in a bit of a rush!
first off I found it interesting to frame R as being presented as an improvement over python. should be the other way around. python for DS came second, and was supposed to improve over R
anyway I'm not sure i can even begin to discuss your use cases, not having that much relevant experience there. I use R (and python) for general analytics tasks and for building production models. in these more traditional DS environments I strongly believe R is far superior to python for data munging and visualization. When I say this I am comparing data.table (and to a lesser extent, tidyverse) to pandas. I don't even want to get started on everything I hate about pandas. So while we are both "cleaning data" you seem to be talking about a stage before someone like me would even be looking.
The use cases I'm focusing on in the article are definitely less "production-friendly" that what you describe. This article definitely caters more to showing R as a CLI tool than R as an ecosystem.
My language must have been ambiguous in the article. I intended to frame R as arriving second because that was my personal experience. My writing philosophy for blog posts is that my personal opinions and experiences should stand out from the technical detail. My reasoning is that even for those who disagree with the personal content will be able to discern for themselves the value of the post.
> I used to resent R, it was shoved upon me as a strange tool that promised to replace Python, but failed miserably.
well I'm glad the author found a way to make R work for their purposes, but this just reeks of inexperience with R..