XML parsing in the Roslyn era
Roslyn was a revolution in the .NET world when it came out. For the first time we had Microsoft release their compiler implementation of C# under an open-source license. Not only that, the way in which this compiler had been constructed would open the doors to numerous improvements in the ecosystem in the forms of:
- A complete API to manipulate code as data
- A compiler platform that is easy to hack on
- A backend that is adapted for editor/language service usage
This enabled faster evolutions of the C# language, a plethora of tools and analyzers that can fully understand your code and intellisense that’s better than ever.
In my day to day job I work on visual tools at Microsoft for our Xamarin platform. As part of this we integrate with both Visual Studio Mac and Visual Studio where we, of course, use Roslyn for most of our user code interactions.
However, those visual tools all have another thing in common: they work on some dialect of XML (be it XAML, Android layout XML, XML resources or others). In a lot of ways working with that XML medium in the context of an editor presents the same challenges than Roslyn solved for code:
- You need high-fidelity parsing of your document preserving user modifications
- You need to react to and create changes applied to specific text subsets
- You need a representation that is easy to manipulate and pass around
Existing XML solutions (System.Xml
, System.Xml.Linq
, …) tend to fall short because they weren’t created with an editor usage in mind. Their use case is more about exposing a document content than caring much about the way it’s written and evolved.
So basically what we need is Roslyn but for XML. Fortunately Kirill thought the same and a few years ago started working on bringing an existing morsel of Roslyn, namely the VB XML literal parser, and transform it into a fully fledged XML parser and syntax representation.
That project is (originally) named XmlParser
In a nutshell, XmlParser brings the Roslyn SyntaxTree AST model (immutable, full-fidelity, error-tolerant) and the parsing infrastructure that can create it to the point that its usage should feel familiar to someone who is used to work with Roslyn already.
As part of ongoing work, I spent a bit of time myself during the last few weeks bringing some improvements to the project to make it even better for an editor use case:
- Proper red-green node separation which is a key to Roslyn performance while retaining immutability
- Parsing incrementality: the ability to reuse existing syntax tree nodes when only a portion of the text buffer has changed
- The start of proper syntax factories and modification APIs
- More Roslyn code import where it made sense (common utilities, caching, …)
Albeit part of this is still a work-in-progress, we published updated NuGet packages (now under the GuiLabs.Language.Xml NuGet ID) with the latest code so that you can try it for yourself.