Classic Emacs syntax highlighting is based on regular expressions ("font-lock-mode"). Of course, the grammars of programming languages are usually not regular languages but higher up in the language class hierarchy (hi, C++!). But you can get a surprising amount of things right just through the context in which a token appears.
For instance, the example of this article (`type` as a keyword vs. `type` as a function) would probably have worked with font-lock-mode as well because you could distinguish the two cases from whether or not a left parenthesis follows the token. But, of course, without proper parsing, there's always the possibility of edge cases that you cannot resolve correctly.
The interesting cases arise anyway when whatever you have in your buffer does not adhere to the grammar, i.e. you have a syntax error: how does then your syntax highlighter cope with that?
(author here) I agree, the `type` example could be done with regular expressions. In part 2 I'm planning to describe the real reason I was using tree-sitter here. I wanted to highlight certain combinations of operations based on the naming conventions I use in one of my projects. In particular, I want to catch a function call where a function named "x_to_y" has an argument with a name that does not appear to be an "x". However, while writing part 1 I realized that I could probably do that with a regular expression…
In addition to leaning mostly on regexps (used in a few ways), the ancient Emacs `font-lock` highlighting also uses "syntax classes" of characters to help tokenize/lex and structure (e.g., is this character an identifier constituent, does it start a string literal, does it start a structural grouping like a parentheses, etc.). There's also some ways to insert arbitrary code to do some things that are harder, like non-regexp lookahead. You can also annotate pieces of text as you go through it, to cache information.
The rules for indenting are actually implemented differently, even though they also involve some kind of parse. And it's not unusual to have to cache context information about the current line, for performance, so that you don't have to look back at preceding lines until you're satisfied you have enough context to indent the current line. The functions to indent multiple lines at once of course might represent this context without having to annotate the buffer.
> you have a syntax error: how does then your syntax highlighter cope with that?
I wrote (but didn't release) an all-new language-specific incremental fast parser for Emacs that recovered from some syntax errors. My general approach was to pick a region of text that included the obvious syntax error, visually highlight it in red, annotate it so that a mouseover would hover an explanation bubble of what's wrong with it, and then continue the parse assuming some reasonable context. You can see screenshots at:
For example, for an unterminated string literal, it would error-highlight the opening quote and subsequent characters up to the first whitespace. For another example, a string literal with an invalid escape sequence would error-highlight the entire string literal up through the closing quote. Another example shown is detecting a character that can't occur in that context (a close-paren immediately after a comment-the-following-s-expression).
Very excited to see parsing for ill-defined states! I like your naming scheme of using animal sounds, but just wanted to bring to your attention that Emacs already has a popular package named meow (for modal editing)
I just updated my page to acknowledge that there's a different project with that name, and I will rename my unreleased project.
(I'd mentioned Meow online several times, years ago, but understandable that they wouldn't have been aware of it, and I have no claim to the name, anyway. Not only was my project never released, but the community where I mostly mentioned it had/has a problem with many posts from our Google Group no longer showing up in Google search hits.)
> I like your naming scheme of using animal sounds,
It originally wasn't. :) The developers of the Scheme implementation family that's now called Racket developed a bespoke IDE for students, called DrScheme (as in doctor), which did some fancy things. For my much less fancy Emacs kludges, I named it "Quack", as in a fake doctor. The animal sounds only came when I needed a name for the successor to Quack.
For instance, the example of this article (`type` as a keyword vs. `type` as a function) would probably have worked with font-lock-mode as well because you could distinguish the two cases from whether or not a left parenthesis follows the token. But, of course, without proper parsing, there's always the possibility of edge cases that you cannot resolve correctly.
The interesting cases arise anyway when whatever you have in your buffer does not adhere to the grammar, i.e. you have a syntax error: how does then your syntax highlighter cope with that?