Two things from the post stand out to me today: > Perl is remarkably good for sl...

mprovost · on Feb 14, 2022

I went down a rabbit hole looking at how they distributed the original sequence for Covid-19 out of Wuhan. Pretty amazing to go back to January 2020 and see the conversations unrolling out in the open.[0] Also, here's the original sequence in FASTA format.[1] It's incredible to think that you can just email these files around, they're just text meant to be parsed by Perl!

[0] https://virological.org/t/novel-2019-coronavirus-genome/319 [1] https://www.ncbi.nlm.nih.gov/nuccore/MN908947

nieve · on Feb 14, 2022

I think at this point something as expressive as PCRE should be table stakes for any language aspiring to be used for text processing. It's so successful that GNU grep added support:

-P, --perl-regexp Interpret I<PATTERNS> as Perl-compatible regular expressions (PCREs). This option is experimental when combined with the -z (--null-data) option, and grep -P may warn of unimplemented features.

lmm · on Feb 14, 2022

PCRE patterns are inherently unmaintainable; they're not (naturally) compositional or testable. They also cause unpleasant surprises due to accidental Turing completeness (indeed that's one of the more common causes of production outages that I've seen). IMO it's past time for newer languages to offer better alternatives, e.g. really good parser combinator support in the standard library.

fho · on Feb 14, 2022

Yes and no ... I love Haskell's Parsec derived family of libraries (especially Attoparsec for fast and Megaparsec for detailed parsing).

But sometimes you just need to throw something together fast. Eg search and replace in Vim uses regexps.

cutler · on Feb 14, 2022

Perl 6 Grammars maybe?

yesbut · on Feb 14, 2022

Link for anyone interested:

https://docs.raku.org/language/grammar_tutorial

rurban · on Feb 14, 2022

Rather structural matchers.

Like match in functional langs or lisp matchers https://www.cliki.net/pattern%20matching Unification is so much better than regex hacks.

ggrrhh_ta · on Feb 14, 2022

Raku/Perl6 grammars are quite nice (particularly if the have improved support for raw strings).

ivanhoe · on Feb 14, 2022

Still, when you have a well-defined input and you can treat data as a flat string (no hierarchy), PCRE is probably the best choice you have - and certainly the fastest one.

lmm · on Feb 15, 2022

True regular expressions (that get compiled to DFAs) are faster if you don't need the funky PCRE features, and you usually don't.

ivanhoe · on Feb 15, 2022

Is there any popular language out there that implements DFA or Thompson NFA regexps (beside awk)?

asicsp · on Feb 14, 2022

`ripgrep` also supports the `-P` option (but PCRE2 instead of PCRE, so as to support replacements as well)

brightball · on Feb 14, 2022

I wonder if there’s a site that compares the speed and capability of regex in different languages?

kqr · on Feb 14, 2022

Not exactly what you asked for, but one aspect: https://swtch.com/~rsc/regexp/regexp1.html

(Scroll down to this plot: https://swtch.com/~rsc/regexp/grep1p.png

Q6T46nT668w6i3m · on Feb 14, 2022

The biggest surprise for many, myself included, around genomic software that arose after the Human Genome Project was that the useful things people wanted to do with sequence data didn’t need need the expressiveness provided by PCRE, et. al. String algorithms have a minor role in day-to-day genomics.

dekhn · on Feb 14, 2022

There was a popular protein fragment recognizer using a pattern language simiilar to regular expressions (https://en.wikipedia.org/wiki/Sequence_motif#PROSITE_pattern...) but this only works on closely related sequences.

As mentioned in other comments, sequence analysis is probabilistic, so "matchers" instead tend to be statistical models, like HMMs. There is a rich relationship between statistical models like HMMs and parsing theory.

inciampati · on Feb 14, 2022

Really? String algorithms (in the stringology / compressed data structures sense) are the foundation of virtually every operation in genomics that interacts with raw data. Have you ever aligned a sequence?

Regexes are not important. But they are a tiny bit of string algorithms.

jacquesm · on Feb 14, 2022

> Have you ever aligned a sequence?

Not the op. Yes, I've done some work on this, then tested it against some of the software used in various environments for this kind of work and more than once spotted alternative, more efficient alignments. The practical upshot of that is that I ended up wondering if there is ever a serious bug found in such a piece of software if it shouldn't automatically cause all papers that used the software for their work to be at a minimum flagged for an additional round of review as well as potentially from being disqualified.

What also struck me is that the people using this software treat it like a black box, they have absolutely no way of verifying that what it did it did right.

Q6T46nT668w6i3m · on Feb 14, 2022

You’re making a good point that’s usually ignored outside of genomics. Inside genomics, alignment is treated, correctly, as a probabilistic rather than deterministic process (i.e., an alignment is not “right” or “wrong”) and many choose to consider multiple alignments.

jacquesm · on Feb 14, 2022

That works as long as you don't try to do things like phylogenetic trees, the ordering becomes so critical that even one swap can make things look like the order was the reverse. Of course you should never rely on just the one datum but the temptation to do so and to treat the software as correct is large due to the pressure to publish rather than to hold back and wait until there are multiple pieces of evidence.

Q6T46nT668w6i3m · on Feb 14, 2022

Are you speaking from experience?

jacquesm · on Feb 14, 2022

I was first hand witness to some stuff that I doubt would withstand prodding but I'm not at liberty to talk about it.

jltsiren · on Feb 14, 2022

There are probably serious bugs in all sequence alignment software, but it's unclear how much it does matter. Downstream analysis must assume that the alignments contain all kinds of known and unknown errors anyway. The sequence alignment problem itself is so ill-defined and there are so many sources of bias and errors, from data to algorithms, from code to reference sequences, from instruments to sample preparation, and including your definition of truth, that you often can't say confidently whether the alignment is correct. The scale of the data is often also big enough that you have to make deliberate trade-offs between costs and correctness.

jacquesm · on Feb 14, 2022

> Downstream analysis must assume that the alignments contain all kinds of known and unknown errors anyway.

It must, but it doesn't always do so.

Q6T46nT668w6i3m · on Feb 14, 2022

I have! I even regularly contribute to a popular alignment application! Regardless, you’re absolutely correct and I should’ve chosen my words more carefully. Especially about the relationship to information theory. I meant “regular expressions, LL, LR, LALR, etc.”

Tsarbomb · on Feb 14, 2022

In modern genomics, you are seeing more and more use of BAMs (Binary versions of SAM files which themselves are Sequence Alignment Files) even for unaligned data that FASTQ was normally used for. Not only are they smaller in size, but they themselves have a compressed format called CRAM which can use lossy or lossless compression depending on the use case.

Interestingly almost all of the petabyte scale and beyond processing of genomes (whole or exome) is done on the JVM as the library and toolkit ecosystem is extremely mature and it is significantly more performant than just scripting things in Perl. Having access to the big data ecosystem that runs on the JVM is also another reason why languages like Java and Scala are found in the high performance areas of genomics.

astrobe_ · on Feb 14, 2022

A third one:

> Perl programs are easy to write and fast to develop. The interpreter doesn't require you to declare all your function prototypes and data types in advance, new variables spring into existence as needed, calls to undefined functions only cause an error when the function is needed. The debugger works well with Emacs and allows a comfortable interactive style of development.

I think each and every language that could undo this second "billion dollar mistake", did ("strict mode").

vanusa · on Feb 14, 2022

The point is, like PHP and FORTRAN it got millions of people programming who otherwise wouldn't have -- precisely because of its loosey-goosey philosophy (and lack of default strictness). And because it could be used to do some seriously powerful stuff, and get it out the door much more quickly than in the C/C++ world (arguably its only real competitor at the time).

Of course over time, these same people learned to program better, and its looseness and general wackiness grew into a liability.

But the important point here is: language design decisions (just like product decisions) aren't so much intrinsically right or wrong; but right or wrong at certain times.

In its heyday, Perl was, for many people, definitely the right way to go.

Barrin92 · on Feb 14, 2022

that reminds me of https://rosalind.info for people who want a coding puzzle introduction to bioinformatics

sfmike · on Feb 14, 2022

wouldn't this make Perl a great language to use GPT with for generating text compared to other langs?