Carolyn vs. the Data Serialization Formats

My current C# project involves a great deal of data and some fairly complicated object structures, which I save to file and reload every time I start the program.

Two console windows showing the serialization of a data file containing cereal names.

This is not the project in question. I just think cerealization is funny.

I’ve been saving and loading with .NET binary serialization, which works wonderfully right up until I want to make changes to the file format. Then my system rejects the old file and I have to start over. This is getting old – and it will get even older for beta users down the line. I need version tolerance when I’m serializing and deserializing my files.

It is possible to do custom serialization and version-tolerant serialization with the binary serializer. But over time, version-tagging all my fields will make my code look very ugly, as you can admire here– and if I ever clean those tags up, old documents will stop working. This isn’t an ideal solution for me.

I decided to keep the binary serializer, but to augment it with a second serialization system that will be more resistant to variable changes. Also, I wanted something that I could open, read, and edit by hand at need. This brought me to XML, JSON, and YAML.

Many helpful people sent advice by tweets and email!

One friend commented:

Just wanted to say, as someone who worked professionally with XML for roughly a decade, I’d lean heavily towards YAML or JSON.”

This was a pretty common sentiment.

Here’s a rough summary of my four-day-old understanding. It may be wildly off base. If you trip over this page while doing research, please reach for the saltshaker. However, if you know this topic better than I do and want to correct some grievous misconception, I would love to hear from you in the comments!

XML

XML came first. It’s very aggressive about ensuring that its syntax rules are obeyed, and if they aren’t, it will reject the entire file with prejudice. While it’s technically human readable, it’s not very human readable.

JSON

JSON is more forgiving than XML, and much easier to work with. It’s based on Javascript, and most JSON-formatted text can be processed as JavaScript code, but it’s no longer Javascript-specific. It’s possibly the most heavily ported data format, in large part because it’s commonly used to transmit data between servers and web applications.

YAML

YAML is actually built on top of JSON. Where JSON is good at transmitting data, YAML is good at transmitting object structures. One of the most important features of YAML is the node anchors system, which allows objects to track each other with various IDs in the file instead of reprinting everything, as shown here on Wikipedia.

Using YamlDotNet

My system has a lot of circular references, and I want my system to handle them smoothly. JSON is capable of handling circular references (as John Bubriski explains here) but the node anchors make YAML a lot easier to read.

The catch with YAML: there are solid implementations for C and C++, but the most widely adopted C# implementation (YamlDotNet) is … lacking.

Specifically, it’s lacking documentation. It says so on the yamldotnet site: “The documentation is lacking.” Instead of documentation, it has a list of samples and a link to StackOverflow.

A quick terminology digression: in C#, a field is a variable, and a property is a way to access a field via get and set methods. For example:

int distance = 0;
int velocity = 0;
string color = null;

public string Color {
    get { return color; }
    set { color = value; }

In the example above, the fields are distance, velocity, and lower-case color. Upper-case Color is a public property, and you can set the private field color by setting the public property Color.

The most important thing to know about YamlDotNet is that it only serializes public properties. All fields will be ignored, no matter what their access levels may be.

This initially threw me. I wanted to serialize my fields. On StackOverflow, Antoine Aubry (the creator and maintainer of YamlDotNet) explained: “You could alter the behavior of YamlDotNet to serialize private fields relatively easy, but I do not recommend that.”

That sounded like what I wanted, so I rolled up my sleeves and prepared to do the Not Recommended Thing. Some investigation in YamlDotNet located the interface IYamlSerializable, with the summary note “Allows a class to customize how it is serialized and deserialized.” The interface involved two methods, ReadYaml and WriteYaml, which sounded like exactly what I wanted.

Sadly, it wasn’t. As of version 3.5.1, there are no references to ReadYaml in YamlDotNet, and the WriteYaml reference isn’t correctly hooked up.

What about a different .NET implementation of YAML? The two suggested at the Official Yaml Page (yaml-net and yatools.net) were created in 2006 and 2009 respectively, and it doesn’t look like either one is being maintained. There are several other YAML packages on NuGet, but I don’t have any confidence that they’re heavily used or reliably maintained, since they have 80 downloads apiece or so.

But – controlling field access with properties is a best practice in C# programming. (Jon Skeet goes into more detail about why this is in Why Properties Matter.) I’ve been using properties erratically rather than reliably. If YamlDotNet forces me to clean up my code… well, that’s not a bad thing, actually.

So I’m going to use YamlDotNet (and clean up my code so it will work correctly), and then go from there.

Bookmark the permalink.

4 Comments

  1. I used to kvetch about data encapsulation with properties too. It felt silly to have a list of private fields in lower case and a list of identical properties to encapsulate them in CamelCase. But auto-implemented properties in C# 3.0 made it pretty trivial to convert fields to properties, except that you have to do initialization in a constructor (though C#6, which will ship with VS2015, includes auto-implemented property initialization: public string MyProperty { get; } = “Hello World!”)

  2. No problem:) It seems the general trend of the language has been to make coding easier and more intuitive. I’ve only an occasional programmer, but C# is my favorite language.

Leave a Reply to Carolyn VanEseltine Cancel reply

Your email address will not be published. Required fields are marked *