Some 🌳💞 for RPM Spec

I had attempted to start the project the year before but was lost. The next year, the tree-sitter documentation improved, and I finally understood the basics and began developing tree-sitter-rpmspec.

RPM Spec is challenging to parse. RPM spec files are parsed in multiple stages, roughly following these phases:

Two years ago, I started to look into writing a tree-sitter parser as part of the “Day of Learning” at my employer. As I write and edit many RPM Spec files (also at work), I wanted better highlighting in my text editor, which is Neovim.

I had attempted to start the project the year before but was lost. The next year, the tree-sitter documentation improved, and I finally understood the basics and began developing tree-sitter-rpmspec.

RPM Spec is challenging (horrible) to parse. RPM spec files are parsed in multiple stages, roughly following these phases:

Phase 1. Macro expansion pass

  • Macros (%{name}, %{version}, %define, %global, etc.) are expanded
  • This happens before the spec is interpreted as instructions
  • Macros can be nested and are expanded recursively

Phase 2. Conditional evaluation

  •  %if, %ifarch, %ifos, %endif blocks are evaluated
  •  This determines which sections of the spec are active

Phase 3. Section parsing and execution

  • The preamble (Name, Version, Release, etc.) is parsed first
  • Then each section (%description, %prep, %build, %install, %files, etc.) is parsed
  • Some sections like %files have their own sub-parsing for file attributes
  • Once parsed it will start executing different scriptlets to build and install the package.

As soon as BuildArch is involved, the RPM parser needs to be able to re-read the spec file. This is one reason why it can’t read spec files from stdin, see e.g. here.

Writing a tree-sitter parser for spec files is not straightforward. There are many pitfalls and edge cases. The two most difficult challenges are figuring out when a section ends. There is no marker or indentation. It ends when the next section starts. However right before the next section could be an %if. Does that %if belong to the section before or is it a top-level if like #ifdef in C.

If you’re interested in the details, there is a DESIGN.md explaining some design decisions. In short, the parser.c was reaching 64MB and -Woverflow was triggered. This led to implementing an external scanner, which reduced the size to ~20MB. I rewrote the scanner.c at least 5 times from scratch. In the last rewrite, I started with the most simplest approach and built on from there, focusing on balanced parenthesis parsing for %{expand: string}.

Yesterday, I hit the breakthrough. I successfully parsed parametric macros correctly. With this milestone, I was able write injection queries to run tree-sitter-bash on the scriptlets shell code! This means we can highlight the bash parts now!

The image shows a the Samba spec file with syntax highlighting.
tree-sitter-rpmspec syntax highlighting including tree-sitter-bash injections.

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *