Text Processing in the Shell

shp0ngle · on March 17, 2020

Sometimes I find it strange - in both good and bad way - that we are, in 2020, learning tools and languages designed and build in the 80s, with models and constraints of the time, with 40 years of layers of backwards compatibility, and actually going sometimes back to the 70s

I am still learning tools designed around the constraints of teleprinters

Sure, it’s the same on Windows side (and macOS side with their classic OS compatibility layers still present, like all the HFS stuff). Not bashing bash here.

Surely our computers have very different models of operation than PDP-11, yet we are sometimes pretending it doesn’t

BaltoRouberol · on March 17, 2020

Author here. Quick sidenote: this is chapter 2 of a book aiming at teaching the terminal and related tooling (make, jq, regular expressions, etc) to beginners and people trying to become developers.

I've been on the job for 10 years and I still use these tools daily. For example, I sent a PR yesterday that was adding a configuration entry on 150+ config files by using find, grep and sed.

I'm not pretending these are the only tools that exist, but darn are they handy sometimes, and good to have in your toolbelt.

shp0ngle · on March 18, 2020

Please don’t take it wrong, I like what I read so far! It was just my random observation

enriquto · on March 17, 2020

The English language is several hundreds of years old, and the ADN is several hundreds of millions years old. Everything we are based on is legacy; I do not see why it should be different in computing.

hnlmorg · on March 17, 2020

In fairness to the OP, there's a lot of cruft in the way terminal work that really isn't necessary any more but need to be there because of backwards compatibility. Such as

- formatting being in-lined via ANSI escape sequences,

- and there's a massive disparity between what escape sequences terminal emulators support,

- control codes being part of the same character set as printable characters,

- changing the behaviour of the TTY requires either terminal emulator support or OS support depending on the behaviour you require because the TTYs are defined partially via kernel drivers (which requires syscalls to alter) and partially by escape sequences,

- and in the case of kernel behaviour, those syscalls vary from one OS to another. Some OS's don't even support from TTY behaviours that other OSs do so you can't even guarantee that logic is cross platform and that you just wrap around specific differences in syscalls,

- resizing terminals UIs can be a nightmare -- often requiring capturing RPC signals and redrawing -- because there's no native layout system for drawing to the TTY,

This isn't meant as a criticism though because there's a lot the general design of terminals gets right (eg the kernel driver for TTY allows us to kill processes over remote shells like SSH and mosh). But I think terminals are one of those things that work "good enough" that most of the ugliness is hidden from everyday users. However if we were to redesign UNIX terminals from the ground up there is a lot of things most engineers would like to change and of lot of places where things could be improved. Like having an out-of-band channel for sending meta-data describing the pipeline but shouldn't be mixed in with the byte stream.

JdeBP · on March 17, 2020

... and pretty much all of those were addressed, outwith Unix, by the evolution of the 1960s terminal I/O model into the console I/O model during the 1980s.

You even forgot to mention one of the things that was addressed: input. Terminal I/O input, done properly, requires a full ECMA-48 decoder state machine, with bodges to accommodate non-conformant warts from the Linux KVT, SCO Console, and RXVT. This is all too often not done properly, because people do not realize that there is ECMA-48 in both directions; and is a mess, just looking at function keys alone and not even accounting for keypad application/normal modes and a mouse/locator. Console I/O evolved into uniform input event records for HIDs that did not require state machines to decode.

Note that the lack of a layout system is only applicable to character-mode terminals. Block-mode terminals are a quite different kettle of fish.

hnlmorg · on March 17, 2020

> ... and pretty much all of those were addressed, [outside?] Unix, by the evolution of the 1960s terminal I/O model into the console I/O model during the 1980s.

> Note that the lack of a layout system is only applicable to character-mode terminals. Block-mode terminals are a quite different kettle of fish.

Indeed but the point isn't "are these solvable problems?" but rather "why are we still using archaic tech?"

Designing a solution to those problems is actually the easy part. It is shifting the ecosystem away from TTYs that's hard.

> You even forgot to mention

It wasn't intended as an exhaustive list :) There's plenty more issues I hadn't raised.

---

In an ideal world I'd love to see UNIX terminals reinvented. The reality is things are "good enough" for most people that they simply don't notice most of the issues and migrating to the next evolution of UNIX terminals would mean a break in backwards compatibility which will be more disruptive (initially in a negative way) than making do with the warts we currently have.

dublin · on March 17, 2020

Check out Microsoft's new Windows terminal (https://www.hanselman.com/blog/ItsTimeForYouToInstallWindows... ) It may be the most modern, capable, and yet compatible terminal I've come across, and works well regardless of the environment you want to run it in/with. (To be fair, a modern Windows terminal is a good 20 years overdue, but MS deserves credit for finally getting it right...)

JdeBP · on March 17, 2020

It doesn't really support all of the basics as exemplified in https://github.com/jdebp/terminal-tests/blob/master/PowerShe... and https://github.com/jdebp/terminal-tests/blob/master/PowerShe... yet.

https://github.com/jdebp/terminal-tests/blob/master/PowerShe... is arguable, and depends from whether one sides with old actual DEC VTs or the ECMA-48:1986 standard, which the DEC VTs didn't keep up with. There are at least two existing terminal emulators that side with the new 1986 semantics, nowadays.

It really isn't anywhere near the most capable terminal emulator right now. This is acknowledged by its developers, and they even have long to-do lists of missing stuff, both as compared to a DEC VT and compared to the likes of XTerm.

hnlmorg · on March 18, 2020

If it's backwards compatible with TTYs then it probably hasn't solved any of the problems I mentioned and there's already a plethora of nice terminal emulators out there (which is why I say most of the "ugliness" is hidden from everyday users).

It should also be noted that Windows makes a few mistakes when it comes to terminal design too:

- For starters cmd.exe is both a shell and terminal emulator and it's impossible to separate the two.

- To compound things, many common commands like dir, rm, etc are shell builtins (this will be a throwback to DOS). So you cannot even use an alternative shell on Windows without having to either invoke cmd.exe or rewrite existing utilities.

- And if that wasn't bad enough, cmd.exe builtins do not read from STDIN. They instead use DOS syscalls to read keyboard input.

- As you've probably guessed, cmd.exe isn't the only culprit which does this. Any "Windows" command line program designed for or which uses DOS APIs will not follow the standard streams idiom. Any command line software written for NT, however will. So you end up having to write all sorts of really nasty hacks just to get the command line working on Windows (far far nastier than any of the hacks that happen on UNIX/Linux).

- Then you have Powershell, which is an entirely separate command line in its own right and largely - though not completely - incompatible with cmd.exe.

- And WSL, which is also incompatible with cmd.exe and Powershell.

At least on UNIX/Linux, you have one terminal methodology. From there you can use whichever terminal emulator you want, whichever shell you want, whichever programming language to write CLI tools and/or whichever CLI tools you want to download. Where as on Windows you have 4 competing standards which don't cooperate well.

JdeBP · on March 18, 2020

It behooves one not to make egregious mistakes when talking about the mistakes supposedly made by Windows. cmd.exe is not a terminal emulator at all, and does not make DOS system calls (it being a Win32 program) at all.

* http://jdebp.uk./FGA/a-command-interpreter-is-not-a-console....

And the fact that CMD and PowerShell provide different interpreted languages is no different to the Korn shell, tclsh, and Perl providing different languages.

hnlmorg · on March 18, 2020

> It behooves one not to make egregious mistakes when talking about the mistakes

Admittedly it's been a little while since I last played around with custom shells and terminals on Windows and I had also rushed my post so you're right that some details were wrong but you're just as far off with your corrections as the points you were criticising so you're really not in a position to be making platitudes about egregious mistakes.

I didn't say cmd.exe was a terminal emulator, I said it was multiple layers including the terminal emulator but not exclusively the terminal. Ok, techically it's conhost.exe that provide the terminal emulation, I'd lazilly lumped that together with cmd.exe because cmd.exe depends on conhost.exe when not run headless. The former requires the latter so you can't just drop cmd.exe into another terminal emulator and run it (other people have tried and there's extensive blog posts about the hacks they've had to do to get it to work, like running conhost.exe off screen).

If you want to be 100% technically accurate then cmd.exe is actually not like any of those things we've described. It's certainly not equivalent to Korn or other language REPLs as you stated. In fact the "language" part of cmd.exe is barely a macro language (again, due to it's DOS heritage). Plus shells orchestrate with byte streams where as NT's streams work very differently and cmd.exe doesn't even behave correctly when used as a CLI tool (which I'll get into below).

You're right that cmd.exe itself doesn't make DOS syscalls, as said aboveI was rushing my post which caused me to conflating two points. What I really meant to say was:

1. cmd.exe builtins read from the NT console API's stdin. Which means you cannot fork out to cmd.exe as a CLI command because any prompts ("Are you sure you wish to delete" type things) just whiz straight past without pausing for input. This was highly annoying when I was developing my alternative Windows shell and wanted to make use of rm, copy, etc rather than having to write those commands all over again.

2. Windows, and cmd.exe by extension, supports running other console applications which don't use NT's console streams because they favour some of the other hacks used in the DOS days. This means those applications also don't work with alternative shells let alone alternative terminal emulators.

Also I think it's disingenuous citing your own blog post as a source. I could link you to the Github repository where I've had to put in numerous workarounds for the shell I've written to work with Windows. But instead I'll link to something a little more recognised:

https://devblogs.microsoft.com/commandline/windows-command-l...

(I did have a hunt around for the blog posts from other developers building console solutions for Windows and the similar problems they've ran into but since it was around 5 years ago when I gave up first party Windows support, those blogs are now lost in the mists of the ether).

JdeBP · on March 18, 2020

You've got it backwards. This is the reality where UNIX (et al.) terminals were reinvented, throughout the 1980s as I said. This is the real problem, people going around talking of making a "better terminal", and when pressed talking about things that the world actually did make better, decades ago, and treating as speculative I/O paradigms that demonstrably exist today.

hnlmorg · on March 18, 2020

I know full well terminals have been "reinvented" but, and as I've already said to you in that very post you're replying to, solving these problems isn't the hard part. It's getting people to migrate to it. The simple fact is UNIX nor Linux doesn't adopt any of those designs you've been discussing and this conversation is specifically about the POSIX command lines. So you can argue all you like about how x does y better but it doesn't mean shit if nobody can actually use x.

I've been doing this for 30+ years. I've written my own terminal emulators and UNIX shells. You're not the only nerd on here so take a step back and listen to the points people are making before assuming they need to be re-educated :)

msla · on March 17, 2020

> the evolution of the 1960s terminal I/O model into the console I/O model during the 1980s.

Are you talking about MS-DOS-style memory-diddling to achieve things like colors and reverse video?

JdeBP · on March 17, 2020

Read https://news.ycombinator.com/item?id=17238350 and https://news.ycombinator.com/item?id=22603878 in this very discussion.

msla · on March 17, 2020

So, to quote:

> Platforms with "consoles". These platforms provide a concept of a "console" to applications programs. Consoles support direct screen addressing, to the level of character cells at least, and are accessed through an API that is a first-class part of the overall system API.

> Platforms with "terminals". These platforms provide a concept of a "terminal" to applications programs. Terminals are not directly addressed, but are communicated with via byte stream communications protocols, involving control characters and control character sequences.

In short, you're deliberately being non-specific. I'd claim that a terminal system plus ncurses is a flexible and reasonably efficient console which is portable between many different systems, on the hosting end and on the client end. I could claim that the IBM block-mode "terminals" are consoles if the system is taken as a whole, but such things are markedly less flexible than what you can accomplish with ncurses, albeit more machine-efficient on the host side.

As far as first-class APIs go, you can argue with others about the kernel-vs-OS distinction, and the irrelevance of unbundling in the Open Source world. In short, shipping with ncurses is no more "odd" than shipping with Gtk or, say, a web browser.

JdeBP · on March 18, 2020

Actually, I'm the one using the specific terminology (from Win32) here. A console is a specific I/O abstraction, widely known as such. Whereas "I'd claim that a terminal system plus ncurses is a console" is your own idiosyncratic nonce definition, in contrast.

Console I/O has a demonstrable evolution over the course of the 1980s, as I have already explained several times, a lot of which was to address the shortcomings of the 1960s terminal I/O model.

ken · on March 17, 2020

And English is terrible as a language! We’ve been making fun of how bad it is for centuries.

Compare languages like Turkish or Korean, which are still natural languages but got a well-thought-out tune-up in the not-so-distant past.

Why can’t my language have a pluralization rule (for example) that’s so simple and regular it takes 1 minute to teach, and 3 minutes to master? Or an alphabet that looks like how it is pronounced, so we don’t have to waste hours each week as children memorizing thousands of special cases? This is absurd.

gjm11 · on March 17, 2020

It could! Given enough international cooperation, we could all adopt a lightly modified version of English that's more regular and easier to teach and learn.

And then, over the coming years, it would absorb words from other languages with different rules, and evolve according to what people find easy or convenient (or just at random), and in a little while it would be irregular once again.

As you said, Turkish and Korean were changed recently. Give them time and some of that regularity will get chipped away.

(Also, of course: make substantial changes to how the language works and suddenly no one can comfortably read any of the vast quantities of existing writing unless it's translated. And some of that existing writing is really good.)

lordgrenville · on March 17, 2020

What do you mean by ADN? Possibly a non-English equivalent for DNA? (My guess based on EN Wikipedia's disambiguation page.)

enriquto · on March 17, 2020

Yep, sorry, I'm used to this acronym in french.

ekianjo · on March 17, 2020

> I am still learning tools designed around the constraints of teleprinters

We are still reading books using a 2k year old alphabet to represent ideas. Not sure why it would be surprising that text manipulation is still the norm pretty much in everything we do, including computing.

Timpy · on March 17, 2020

If we're going to update, I think we should stop using C for /s/ and /k/ noises (we already have characters for those). It can exclusively represent /tʃ/, as in "China" would become "Cina". We don't need X either for much the same reason, which is all the better because we'll need a character for /ʃ/. "Delicious" will become "delixious", "shadow" will become "xadow", "nation" will become "naxion".

I know this sounds diffikult but migraxion to a new system is always diffikult. We're talking about a pretty serious rewrite here so everyone's kooperation will be nesessary. It's not an easy desixion but if you cek into it we can use other languages as a model. Spelling reforms are not a new konsept.

ken · on March 17, 2020

Turkish did this right, IMHO, when it adopted a Latin alphabet. No diphthongs at all. One letter, one sound. The “sh“ sound is spelled ş, and “ch” is ç.

You also need to add more vowels, as there are way more than 5 vowel sounds, but that’s not hard. Umlauts are already common in many languages.

kortex · on March 17, 2020

If anything, c should be /ʃ/ and tc should be /tʃ/. Looks funny but it's very phoenetic.

But people have tried this (spelling reforms/re-phoenetification of English). A lot.

Language drifts, that's just what it does.

Koshkin · on March 17, 2020

This has been done before: http://www.davidpbrown.co.uk/jokes/european-commission.html.

Koshkin · on March 17, 2020

On the other hand, I assume you do not ride on a horseback to where you work.

dublin · on March 17, 2020

No, but that would be my dream job...

pixelmonkey · on March 18, 2020

If you think of UNIX tools the way a chef thinks about kitchen tools (like chefs’ knives and cutting boards, or mixing bowls and frying pans), you won’t be as surprised to be using stuff that dates back decades. Good design is built to last and to enhance the skills of an experienced craftsperson. I wrote a little on this here:

https://amontalenti.com/2012/06/22/unix-kitchen

a3n · on March 17, 2020

> and actually going sometimes back to the 70s

Try 1725, especially for those of us who (still) set their editors to favor 80 columns.

https://en.m.wikipedia.org/wiki/Punched_card

dublin · on March 17, 2020

Remember to start your FORTRAN statements in Column 7!

Lio · on March 17, 2020

It’s because fundamentally, when you get right down to it, we’re manipulating text files.

Text is a very dense way to represent logic and ideas.

This is not like legacy software that you can rewrite.

You can add a GUI based on ideas from the late 70s/80s if you like but you’re still unlikely to come up with a more succinct way to represent logic than can be held in a text file.

So it follows that small tools that deal directly with manipulating text will be useful as long as text is useful.

zokier · on March 17, 2020

I'd say that fundamentally we are manipulating streams or arrays of numbers. Everything is built on top of that. Each character in a tex is a number. Each pixel is a number. Each sample of sound is a number. For computers how the number ends up being transferred to the puny meatbag is completely incidental.

In more practical terms at least for me, vast majority of stuff I mangle through shell is not really text but structured data, typically either some sort of tree or a table, or something in-between. Sometimes the structure is more ad-hoc, sometimes it is very rigidly defined, but it's still there

robenkleene · on March 17, 2020

Why do you find this strange? There’s a pattern across all fields: Different paradigms compete early on, and the best paradigm wins and we build on it incrementally for years (forever?) after. Every once in awhile a new workflow emerges based on increases in hardware capabilities: Ableton Live, Adobe Lightroom, and Figma are all examples of this. Notably these new tools usually split the market rather than replace the previous version: Logic, Photoshop, and Sketch are still popular.

0xdeadbeefbabe · on March 17, 2020

Sure it's hard for us, but what about creatures that have 28 eyes and 14 brains? In other words, we also have more CPU cores now. I'm not sure I'm keeping all my CPU cores happy.

And no I don't think it's just a problem for the OS. Although, that's probably a popular idea.

throwaway_pdp09 · on March 17, 2020

I find these posts annoying that say we're anchored in past tech and we must be able to do better, but never suggest how.

Please make a suggestion or two.

(NB ultimately our character handling is based on 'writing' which goes back thousands of years, not 50, and it survives well).

JdeBP · on March 17, 2020

I suggest learning from history. The doing better has already happened. The 1960s terminal I/O model evolved into the console I/O model during the 1980s; from firmware-mediated access to CGA, through the VIO+KBD+MOU subsystems of OS/2, to Windows NT console objects.

* http://jdebp.uk./FGA/tui-console-and-terminal-paradigms.html

* https://news.ycombinator.com/item?id=17238350

throwaway_pdp09 · on March 18, 2020

That seems to have nothing to do with text processing, only screen addressability.

JdeBP · on March 18, 2020

If you actually read the whole discussion to this point, you'll see that the discussion started with being constrained to the models of teleprinters, not text processing.

zokier · on March 17, 2020

Structured data and tools that operate on that structure. Maybe steal a thing or two from LISP machines. I remember the one time I had to work on an AS/400 (iirc) how interesting it was that one of the fundamental building blocks were records instead of dumb files. PowerShell has interesting ideas too, but it'd need to be far more built-in and integrated to the OS to truly shine.

petepete · on March 17, 2020

The beauty of it is that these old tools play extremely nicely with their modern counterparts. ag, fzf, fd and tmux are key tools I use to aid my work on a daily basis alongside many of the classics listed in this article.

RMPR · on March 17, 2020

A little suggestion for the authors, they mentioned xargs, I think [GNU parallel](https://www.gnu.org/software/parallel/) might work a mention too, since it is a kind of modern successor that can use many computers to run tasks.

pletnes · on March 17, 2020

This you have to install, xargs is everywhere. Also with the -P flag you can parallelize the most common cases.

assafmo · on March 17, 2020

Personally I like parallel better because of the `--bar` option and `{}`, `{.}`, `{/}` and `{/.}`. And I usually just use it with `-P 1` anyway.

RMPR · on March 17, 2020

I agree on the install part, but xargs -P still run on the same host and I was addressing the multiple hosts feature of parallel.

raziel2p · on March 17, 2020

Given the article is about text processing and nothing like "big data", I don't think using multiple computers is really an important feature.

latenightcoding · on March 17, 2020

OP you should mention perl one-liners in upcoming chapters https://catonmat.net/introduction-to-perl-one-liners

superasn · on March 17, 2020

Totally agree. I always use Perl instead of sed because sed doesn't support PCRE and it's far easier to write something like '\d+' instead of [:digit:] or whatever the sed equivalents are.

thesuperbigfrog · on March 17, 2020

Agreed. Perl makes it easy to do complex text processing and can replace many individual command line text processing tools.

Recent versions of Perl also support UTF-8 so they can support text processing in different natural languages or internationalization needs. See https://en.wikibooks.org/wiki/Perl_Programming/Unicode_UTF-8

bmn__ · on March 17, 2020

> Recent versions of Perl

Make sure it's recent enough and released after 2002!

thesuperbigfrog · on March 17, 2020

I was thinking of Perl versions 5.14 (released in 2011) and later since that release fixed several Unicode-related bugs.

There have been other improvements and fixes in the versions up to 5.30, so the Unicode support now is pretty transparent.

Many of the classic command line text processing tools are not Unicode aware.

SPBS · on March 17, 2020

I've always viewed `awk '!a[$0]++'` as superior to `sort | uniq` because it preserves order and does not have to sort the data first before deduplicating. But `sort | uniq` is much easier to remember.

asicsp · on March 17, 2020

Or just `sort -u` (if you are using GNU sort, not sure about other implementations)

Another difference is that sort is optimized to handle large files [0]

[0] https://unix.stackexchange.com/questions/279096/scalability-...

dyu · on March 17, 2020

It appears sort has a --stable flag for order preservation. I actually thought sort was stable by default already, but now I know.

fnord123 · on March 17, 2020

This is cool for English text. But once you get Unicode with various ways to represent é, whew lad. This get shitty quickly in the shell.

zokier · on March 17, 2020

I would generalize that further and say that shell processing works well when your data is well-behaved wrt the thing you are trying to do. But once you start venturing into untrusted data, escaping, and yes, Unicode, things can become hairy. So basically you should always be doing some sanity checks when working in shell

dublin · on March 17, 2020

Just drop the eighth bit and use 7-bit ASCII like God intended and it's not a problem... :-)

bmn__ · on March 17, 2020

English is Unicode. Pretending otherwise would be quite naïve. https://www.azabani.com/pages/gbu/#slide4

fnord123 · on March 17, 2020

You claim is not clear.

"Unicode with various ways to represent é" is a shit show to parse using shell tools. e.g. Try scraping Spanish language Twitter feeds. When I have done this kind of work, I made a tool to canonicalize glyphs and had to put it between every step of a pipeline.

assafmo · on March 17, 2020

Lately I've started to use Perl instead of sed for replacing text. Its regex support is much better IMO:

  cat a.txt | perl -pe 's/banana-(\d)/papaya-$1/g'

Or in-place:

  perl -i -pe 's/banana-(\d)/papaya-$1/g' a.txt

clircle · on March 17, 2020

I'll preface my question by saying that I'm not a dev.

Why do this kind of work in the shell? Isn't it better to do this in a programming language that can run on all operating systems? What are Windows users supposed to do?

vlovich123 · on March 17, 2020

> Why do this kind of work in the shell

It can be quicker & easier to write. Also individually these tools can outperform any code you write by hand. Shells are also common on a far wider range of operating systems than typical programming languages like Python et all. Additionally if you want to improve your productivity you may create your own shortcuts (e.g. "build_myproject" which understands what that entails & may involve some amount of text processing among other things). It's typically far more convenient (shorter, simpler & generally easier to understand) to invoke other programs from shell languages since that's what their programming interface is optimized around.

Sometimes it's good to even wrap the entrypoint for common scripting languages like Python in shell so that, for example, you can setup a virtual environment to run out of or use the proper version of Python.

> Isn't it better to do this in a programming language that can run on all operating systems?

Bash & coreutils have been ported to every operating system under the sun (including Sun operating systems). They're even more common & available than any other programming language that doesn't require a compiler (e.g. `adb shell` will get you into an environment where you can grep & do these operations even though there's generally no python or other scripting language available).

> What are Windows users supposed to do?

* cygwin

* WSL

* Windows ports of Bash (http://win-bash.sourceforge.net/, https://gitforwindows.org/, etc)

* msys

Let me conclude this post that you shouldn't really write anything complex or maintained by multiple people in shell if you can avoid it and if you can make guarantees about (for example) the available of a Python interpreter. That doesn't mean that shell scripts aren't a valuable and important part of the development ecosystem.

asicsp · on March 17, 2020

Well, there are ways to use such commands on windows too, like WSL, git-bash, cygwin, etc. And, there's powershell on windows (though I haven't used it and not sure about its capabilities)

As to why use shell, that depends on your use case and working environment. Shell is something like an IDE [0] where you can solve multiple tasks from single environment. You don't have to use multiple programs (window manager, text editor, IDE, etc). Since it is all text, you can save and repeat a command, share it with others, edit a previously written command, etc. This is quite different from a GUI based workflow. Personally, I find using command line more productive, but as mentioned earlier, it'll depend on the task at hand.

[0] https://sanctum.geek.nz/arabesque/series/unix-as-ide/

AdmiralAsshat · on March 17, 2020

As an also-not-a-dev:

Speed. Portability. Muscle memory. I've spent ten years troubleshooting UNIX applications, so most of these commands are fairly well-ingrained into my mode of thinking when I have data that I've got to parse.

To boot, these shell utilities were written by people way smarter than me. I have far more confidence that they will handle edge cases in the data stream infinitely better than whatever dinky little Python script I might try to hash out.

overgard · on March 17, 2020

As a dev: I don't know. It's a well written article and this stuff can be handy in a pinch, but I've yet to see many real world scenarios where a complicated shell script is a good idea. Most of these examples, I would probably rather write a five line python script to ingest the data into sqlite and then use actual queries.

nostoc · on March 17, 2020

The use case is to quickly handle ad hoc scenarios.

If you need to quickly extract something from a csv, you could break out python, or import it into a database, but using cut and grep (or csv-tools) will take 5 seconds.

The point is if you need to do a specific task many times, do it in a programming language. But if you have an ad hoc task, you're saving a lot of time by being proficient in the shell.

overgard · on March 21, 2020

Ad hoc tasks quickly become regular tasks though. If your "database" is csv files you're better off figuring out a better way to structure it.

The way things are moving with containers, the idea you're even going to have these utilities on the server, and the idea that the server is writing this stuff to a file system- that's totally changing. So is this useful for local stuff? Maybe, but is Excel probably more useful there?

nailer · on March 17, 2020

Text processing is scraping. Modern shells have structured output from their stdlib, allow you to pipe to 'where' and 'select', and can read JSON, yaml, etc natively.

gpanders · on March 17, 2020

Which shells in particular are you referring to?

nailer · on March 17, 2020

Mainly pwsh and nushell

MR4D · on March 17, 2020

I wish man pages were this good!

knolan · on March 17, 2020

I like cheat.sh, works great with curl.

http://cheat.sh/

bori5 · on March 17, 2020

https://tldr.sh/

leadingthenet · on March 17, 2020

Even better: https://github.com/chubin/cheat.sh

This includes man pages from tldr, and more! The command line utility had been a great help for me over the past few months.

bori5 · on March 17, 2020

Thank you, did not know about this one!

Gnouc · on March 17, 2020

I invite you to read https://unix.stackexchange.com/q/169716/38906

411111111111111 · on March 17, 2020

i was just missing a small note akin to "which becomes all the more powerful by combining these commands" and then make a totally readable example such as this one

     some_file=example.sh; tail -n +$(( $(wc -l $some_file | grep -o "[0-9]\+") - 5 )) $some_file

boshomi · on March 17, 2020

Pleased add the "join"

jbkiv · on March 17, 2020

[flagged]

SomewhatLikely · on March 17, 2020

Are you a bot? This has nothing to do with NLP.

jbkiv · on March 18, 2020

Nope! but that reply was intended for a different post, sorry :-)

seemslegit · on March 17, 2020

Just Don't. Unix-style text stream processing was super cool in 80s-90s but is born tech-debt today.

dejj · on March 17, 2020

I love to cook 5-course meals for friends and family. But sometimes when I'm all alone by myself, I slice up a bun and slap meat on it. It's as fast and tastes just as good as in the 80ies.

Koshkin · on March 17, 2020

Sure, but try to eat a juicy sandwich that was made in the 80s.

seemslegit · on March 17, 2020

Sandwhiches don't need to be maintained

thekelvinliu · on March 17, 2020

what do you do instead?

worthless-trash · on March 17, 2020

Not OP, but I imagine he's bought into the Hadoop and other big data stories. Most data processing of large data sets can probably be done with standard UNIX tools.

There is nothing new under the sun, it's all just rebranded.

alkonaut · on March 17, 2020

If you do "object manipulation" e.g. columns/records/objects etc then at least Powershell offers some sanity.

hnlmorg · on March 17, 2020

Shameless plug but this is what I wrote murex for. Murex is a "UNIX" shell that's keeps enough similarities with POSIX that you can use it like a Bash REPL with (hopefully) minimal disruption but it breaks from POSIX compatibility where it makes sense. And one significant area it does break compatibility is how it handles structured data formats like JSON, S-Expressions, CSV's, and other tabulated data (to name a few).

It's designed from the ground up to support object manipulation while still retaining compatibility with the UNIX pipeline.

I do this by building a suite of builtin tools that are aware of structured data files (primarily because that information is passed down the pipeline as a data-type) but it still breaks into normal pipeline when forking an external executable.

https://github.com/lmorg/murex

kuschku · on March 17, 2020

Output data as JSON, manipulate and show it with jq

enriquto · on March 17, 2020

I hate jq with the power of a million suns. The whole point of text through pipelines is that data can be processed by tools that do not understand it. Formatting it in xml or json (there's no difference) breaks this beautiful orthogonality, and forces all the intermediate tools to deal with whatever the stupid markup du jour happens to be.

bryanrasmussen · on March 17, 2020

not all data comes from sources that you control and have chosen how to output.

BaltoRouberol · on March 17, 2020

https://github.com/kellyjonbrazil/jc can come pretty handy there

enriquto · on March 17, 2020

oh my god, why?! just why!?

What the world needs is the inverse program of "jc", where an unparseable json string is expanded into a flat list of lines all of the form "field.subfield=value"

Igrom · on March 17, 2020

If you search for "gron", you will find a family of such tools, e.g., https://github.com/tailhook/rust-gron.

enriquto · on March 17, 2020

Thanks, that's just what I needed!

bryanrasmussen · on March 17, 2020

ok well I can understand not wanting json when all you need is something simpler, but I'm not sure if I understand unparseable - I mean if it is JSON then it is parseable.

enriquto · on March 17, 2020

You can only parse json easily by using json libraries. Plain text, or "field=value" pairs, you can easily cut(1) or grep(1), or sed(1) to your pleasure. This is what I mean by parseable. Parseable trivially by tools that do not understand the format. I can also sed and awk json files, and I do, but it is extremely painful; and more often than not these files are nothing more than simple lists of variables with values, for which the use of json is a ridiculous overkill.

bryanrasmussen · on March 17, 2020

Ok, but I believe that your use of the word parseable diverges quite a bit from common usage.

bryanrasmussen · on March 17, 2020

thanks!

oweiler · on March 17, 2020

We use Groovy or Kotlin with KScript.

zmmmmm · on March 18, 2020

Groovy is particularly nice I find. Completely cross platform and supports command line args similar to Perl to enable inline / pipe style processing easily, but if you are using Java in your back end you can throw any of your business logic in there too and use those.

seemslegit · on March 17, 2020

Use any modern programming language.

spurdoman77 · on March 17, 2020

The gnu tools are super fast and do the job. Ive seen people spinning their own solutions which end up being super slow and arguable take more time to develop.

With any gnu tools it is great that they will stay there for your life and are usually by default installed on every system

seemslegit · on March 17, 2020

Yeah, except anything that is not trivial ends up being write-only

matvore · on March 17, 2020

Modern programming language features are overrated. I prefer to choose the tools that are the most portable and involve the least social friction (i.e. something that is widely understood by people I work with).

For "fun" projects (stuff I'm not paid for) and workflow optimization, I just care about portability, which means C (with heavy use of the C stream library, a simple collections library of about 200 LOC, and occasionally POSIX syscalls) and shell scripting. After spending a lot of time learning languages as a hobby, I just don't believe the dark corners and warts of C and shell scripting are any worse than other languages.