Some 🌳💞 for RPM Spec
I had attempted to start the project the year before but was lost. The next year, the tree-sitter documentation improved, and I finally understood the basics and began developing tree-sitter-rpmspec.
RPM Spec is challenging to parse. RPM spec files are parsed in multiple stages, roughly following these phases:
Two years ago, I started to look into writing a tree-sitter parser as part of the “Day of Learning” at my employer. As I write and edit many RPM Spec files (also at work), I wanted better highlighting in my text editor, which is Neovim.
I had attempted to start the project the year before but was lost. The next year, the tree-sitter documentation improved, and I finally understood the basics and began developing tree-sitter-rpmspec.
RPM Spec is challenging (horrible) to parse. RPM spec files are parsed in multiple stages, roughly following these phases:
Phase 1. Macro expansion pass
- Macros (%{name}, %{version}, %define, %global, etc.) are expanded
- This happens before the spec is interpreted as instructions
- Macros can be nested and are expanded recursively
Phase 2. Conditional evaluation
- Â %if, %ifarch, %ifos, %endif blocks are evaluated
- Â This determines which sections of the spec are active
Phase 3. Section parsing and execution
- The preamble (Name, Version, Release, etc.) is parsed first
- Then each section (%description, %prep, %build, %install, %files, etc.) is parsed
- Some sections like %files have their own sub-parsing for file attributes
- Once parsed it will start executing different scriptlets to build and install the package.
As soon as BuildArch is involved, the RPM parser needs to be able to re-read the spec file. This is one reason why it can’t read spec files from stdin, see e.g. here.
Writing a tree-sitter parser for spec files is not straightforward. There are many pitfalls and edge cases. The two most difficult challenges are figuring out when a section ends. There is no marker or indentation. It ends when the next section starts. However right before the next section could be an %if. Does that %if belong to the section before or is it a top-level if like #ifdef in C.
If you’re interested in the details, there is a DESIGN.md explaining some design decisions. In short, the parser.c was reaching 64MB and -Woverflow was triggered. This led to implementing an external scanner, which reduced the size to ~20MB. I rewrote the scanner.c at least 5 times from scratch. In the last rewrite, I started with the most simplest approach and built on from there, focusing on balanced parenthesis parsing for %{expand: string}.
Yesterday, I hit the breakthrough. I successfully parsed parametric macros correctly. With this milestone, I was able write injection queries to run tree-sitter-bash on the scriptlets shell code! This means we can highlight the bash parts now!
