Some of you may know I've been spending whatever time I can scrounge together grinding away at a new serialization library for .NET. Serializers can be complicated beasts. They have to be reliable, flexible, and fast beyond reproach. I won't convince you that serialization libraries have to be quick — in this post, that's a given. These are some tips from my experience in optimizing Hagar's performance. Most of this advice is applicable to other types of libraries or applications.

A post on performance should have minimal overhead and get straight to the point, so this post focuses on tips to help you and things to look out for. Message me on Twitter if something is unclear or you have something to add.

Maximize profitable inlining

Inlining is the technique where a method body is copied to the call site so that we can avoid the cost of jumping, argument passing, and register saving/restoring. In addition to saving those costs, inlining is a requirement for other optimizations. Roslyn (C#'s compiler) does not inline code. Instead, it is the responsibility of the JIT, as are most optimizations.

Use static throw helpers

A recent change which involved a significant refactor added around 20ns to the call duration for the serialization benchmark, increasing times from ~130ns to ~150ns (which is significant).

The culprit was the throw statement added in this helper method:

public static Writer<TBufferWriter> CreateWriter<TBufferWriter>(
    this TBufferWriter buffer,
    SerializerSession session) where TBufferWriter : IBufferWriter<byte>
{
    if (session == null) throw new ArgumentNullException(nameof(session));
    return new Writer<TBufferWriter>(buffer, session);
}

When a method contains a throw statement, the JIT will not inline it. The common trick to solve this is to add a static "throw helper" method to do the dirty work for you, so the end result looks like this:

public static Writer<TBufferWriter> CreateWriter<TBufferWriter>(
    this TBufferWriter buffer,
    SerializerSession session) where TBufferWriter : IBufferWriter<byte>
{
    if (session == null) ThrowSessionNull();
    return new Writer<TBufferWriter>(buffer, session);

    void ThrowSessionNull() => throw new ArgumentNullException(nameof(session));
}

Crisis averted. The codebase uses this trick in many places. Having the throw statement in a separate method may have other benefits such as improving the locality of your commonly used code paths, but I'm unsure and haven't measured the impact.

Minimize virtual/interface calls

Virtual calls are slower than direct calls. If you're writing a performance critical system then there's a good chance you'll see virtual call overhead show up in the profiler. For one, virtual calls require indirection.

Devirtualization is a feature of many JIT Compilers, and RyuJIT is no exception. It's a complicated feature, though, and there are not many cases where RyuJIT can currently prove (to itself) that a method can be devirtualized and therefore become a candidate for inlining. Here are a couple of general tips for taking advantage of devirtualization, but I'm sure there are more (so let me know if you have any).

Mark classes as sealed by default. When a class/method is marked as sealed, RyuJIT can take that into account and is likely able to inline a method call.
Mark override methods as sealed if possible.
Use concrete types instead of interfaces. Concrete types give the JIT more information, so it has a better chance of being able to inline your call.
Instantiate and use non-sealed objects in the same method (rather than having a 'create' method). RyuJIT can devirtualize non-sealed method calls when the type is definitely known, such as immediately after construction.
Use generic type constraints for polymorphic types so that they can be specialized using a concrete type and interface calls can be devirtualized. In Hagar, our core writer type is defined as follows:

public ref struct Writer<TBufferWriter> where TBufferWriter : IBufferWriter<byte>
{
    private TBufferWriter output;
    // --- etc ---

All calls to methods on output in the CIL which Roslyn emits will be preceded by a constrained instruction which tells the JIT that instead of making a virtual/interface call, the call can be made to the precise method defined on TBufferWriter. This helps with devirtualization. All calls to methods defined on output are successfully devirtualized as a result. Here's a CoreCLR thread by Andy Ayers on the JIT team which details current and future work for devirtualization.

Reduce allocations

.NET's garbage collector is a remarkable piece of engineering. GC allows for algorithmic optimizations for some lock-free data structures and also removes whole classes of bugs and lightens the developer's cognitive load. All things considered, garbage collection is a tremendously successful technique for memory management.

However, while the GC is a powerful work horse, it helps to lighten its load not only because it means your application will pause for collection less often (and more generally, less CPU time will be devoted to GC work), but also because lightening working set is beneficial for cache locality.

The rule-of-thumb for allocations is that they should either die in the first generation (Gen0) or live forever in the last (Gen2).

.NET uses a bump allocator where each thread allocates objects from its per-thread context by 'bumping' a pointer. For this reason, better cache locality can be achieved for short-lived allocations when they are allocated and used on the same thread.

For more info on .NET's GC, see Matt Warren's blog post series, Learning How Garbage Collectors Work here and pre-order Konrad Kokosa's book, Pro .NET Memory Management here. Also check out his fantastic free .NET memory management poster here, it's a great reference.

Pool buffers/objects

Hagar itself doesn't manage buffers but instead defers the responsibility to the user. This might sound onerous but it's not, since it's compatible with System.IO.Pipelines. Therefore, we can take advantage of the high performance buffer pooling which the default Pipe provides by means of System.Buffers.ArrayPool<T>.

Generally speaking, reusing buffers lets you put much less pressure on the GC - your users will be thankful. Don't write your own buffer pool, unless you truly need to, though - those times have passed.

Avoid boxing

Wherever possible, do not box value types by casting them to a reference type. This is common advice, but it requires some consideration in your API design. In Hagar, interface and method definitions which might accept value types are made generic so that they can be specialized to the precise type and avoid boxing/unboxing costs. As a result, there is no hot-path boxing. Boxing is still present in some cases, such as string formatting for exception methods. Those particular boxing allocations can be removed by explicit .ToString() calls on the arguments.

Reduce closure allocations

Allocate closures only once and store the result for repeated use. For example, it's common to pass a delegate to ConcurrentDictionary<K, V>.GetOrAdd. Instead of writing the delegate as an inline lambda, define is as a private field on the class. Here an example from the optional ISerializable support package in Hagar:

private readonly Func<Type, Action<object, SerializationInfo, StreamingContext>> createConstructorDelegate;

public ObjectSerializer(SerializationConstructorFactory constructorFactory)
{
    // Other parameters/statements omitted.
    this.createConstructorDelegate = constructorFactory.GetSerializationConstructorDelegate;
}

// Later, on a hot code path:
var constructor = this.constructors.GetOrAdd(info.ObjectType, this.createConstructorDelegate);

Minimize copying

.NET Core 2.0 and 2.1 and recent C# versions have made considerable strides in allowing library developers to eliminate data copying. The most notable addition is Span<T>, but it's also worth mentioning in parameter modifiers and readonly struct.

Use `Span<T>` to avoid array allocations and avoid data copying

Span<T> and friends are a gigantic performance win for .NET, particularly .NET Core where they use an optimized representation to reduce their size, which required adding GC support for interior pointers. Interior pointers are managed references which point to within the bounds of an array, as opposed to only being able to point to the first element and therefore requiring an additional field containing an offset into the array. For more info on Span<T> and friends, read Stephen Toub's article, All About Span: Exploring a New .NET Mainstay here.

Hagar makes extensive use of Span<T> because it allows us to cheaply create views over small sections of larger buffers to work with. Enough has been written on the subject that there's no use me writing more here.

Pass structs by `ref` to minimize on-stack copies

Hagar uses two main structs, Reader & Writer<TOutputBuffer>. These structs each contain several fields and are passed to almost every call along the serialization/deserialization call path.

Without intervention, each method call made with these structs would carry significant weight since the entire struct would need to be copied onto the stack for every call, not to mention any mutations would need to be copied back to the caller.

We can avoid that cost by passing these structs as ref parameters. C# also supports using ref this as the target for an extension method, which is very convenient. As far as I know, there's no way to ensure that a particular struct type is always passed by ref and this can lead to subtle bugs if you accidentally omit ref in the parameter list of a call, since the struct will be silently copied and modifications made by a method (eg, advancing a write pointer) will be lost.

Avoid defensive copies

Roslyn has to do some work to guarantee some language invariants sometimes. When a struct is stored in a readonly field, the compiler will insert instructions to defensively copy that field before involving it in any operation which isn't guaranteed to not mutate it. Typically this means calls to method defined on the struct type itself because passing a struct as argument to a method defined on another type already requires copying the struct onto the stack (unless it's passed by ref or in).

This defensive copy can be avoided if the struct is defined as a readonly struct, which is a C# 7.2 language feature, enabled by adding <LangVersion>7.2</LangVersion> to your csproj file.

Sometimes it's better to omit the readonly modifier on an otherwise immutable struct field if you are unable to define it as a readonly struct.

See Jon Skeet's NodaTime library as an example. In this PR, Jon made most structs readonly and was therefore able to add the readonly modifier to fields holding those structs without negatively impacting performance.

Reduce branching & branch misprediction

Modern CPUs rely on having long pipelines of instructions which are processed with some concurrency. This involves the CPU analyzing instructions to determine which ones aren't reliant on previous instructions and also involves guessing which conditional jump statements are going to be taken. In order to do this, the CPU uses a component called the branch predictor which is responsible for guessing which branch will be taken. It typically does this by reading & writing entries in a table, revising its prediction based upon what happened last time the conditional jump was executed.

When it guesses correctly, this prediction process provides a substantial speedup. When it mispredicts the branch (jump target), however, it needs to throw out all of the work performed in processing instructions after the branch and re-fill the pipeline with instructions from the correct branch before continuing execution.

The fastest branch is no branch. First try to minimize the number of branches, always measuring whether or not your alternative is faster. When you cannot eliminate a branch, try to minimize misprediction rates. This may involve using sorted data or restructuring your code.

One strategy for eliminating a branch is to replace it with a lookup. Sometimes an algorithm can be made branch-free instead of using conditionals. Sometimes hardware intrinsics can be used to eliminate branching.

Other miscellaneous tips

Avoid LINQ. LINQ is great in application code, but rarely belongs on a hot path in library/framework code. LINQ is difficult for the JIT to optimize (IEnumerable<T>...) and tends to be allocation-happy.
Use concrete types instead of interfaces or abstract types. This was mentioned above in the context of inlining, but this has other benefits. Perhaps the most common being that if you are iterating over a List<T>, it's best to not cast that list to IEnumerable<T> first (eg, by using LINQ or passing it to a method as an IEnumerable<T> parameter). The reason for this is that enumerating over a list using foreach uses a non-allocating List<T>.Enumerator struct, but when it's cast to IEnumerable<T>, that struct must be boxed to IEnumerator<T> for foreach.
Reflection is exceptionally useful in library code, but it will kill you if you give it the chance. Cache the results of reflection, consider generating delegates for accessors using IL or Roslyn, or better yet, use an existing library such as Microsoft.Extensions.ObjectMethodExecutor.Sources, Microsoft.Extensions.PropertyHelper.Sources, or FastMember.

Library-specific optimizations

Optimize generated code

Hagar uses Roslyn to generate C# code for the POCOs you want to serialize, and this C# code is included in your project at compile time. There are some optimizations which we can perform on the generated code to make things faster.

Avoid virtual calls by skipping codec lookup for well-known types

When complex objects contain well known fields such as int, Guid, string, the code generator will directly insert calls to the hand-coded codecs for those types instead of calling into the CodecProvider to retrieve an IFieldCodec<T> instance for that type. This lets the JIT inline those calls and avoids virtual/interface indirection.

(Unimplemented) Specialize generic types at runtime

Similar to above, the code generator could generate code which uses specialization at runtime.

Pre-compute constant values to eliminate some branching

During serialization, each field is prefixed with a header – usually a single byte – which tells the deserializer which field was encoded. This field header contains 3 pieces of info: the wire type of the field (fixed-width, length-prefixed, tag-delimited, referenced, etc), the schema type of the field (expected, well-known, previously-defined, encoded) which is used for polymorphism, and dedicates the last 3 bits to encoding the field id (if it's less than 7). In many cases, it's possible to know exactly what this header byte will be at compile time. If a field has a value type, then we know that the runtime type can never differ from the field type and we always know the field id.

Therefore, we can often save all of the work required to compute the header value and can directly embed it into code as a constant. This saves branching and generally eliminates a lot of IL code.

Choose appropriate data structures

One of the big performance disadvantages Hagar has when compared to other serializers such as protobuf-net (in its default configuration?) and MessagePack-CSharp is that it supports cyclic graphs and therefore must track objects as they're serialized so that object cycles are not lost during deserialization. When this was first implemented, the core data structure was a Dictionary<object, int>. It was clear in initial benchmarking that reference tracking was a dominating cost. In particular, clearing the dictionary between messages was expensive. By switching to an array of structs instead, the cost of indexing and maintaining the collection is largely eliminated and reference tracking no longer appears in the benchmarks. There is a downside to this: for large object graphs it's likely that this new approach is slower. If that becomes an issue, we can decide to dynamically switch between implementations.

Choose appropriate algorithms

Hagar spends a lot of time encoding/decoding variable-length integers, often referred to as varints, in order to reduce the size of the payload (which can be more compact for storage/transport). Many binary serializers use this technique, including Protocol Buffers. Even .NET's BinaryWriter uses this encoding. Here's a snippet from the reference source:

protected void Write7BitEncodedInt(int value) {
    // Write out an int 7 bits at a time.  The high bit of the byte,
    // when on, tells reader to continue reading more bytes.
    uint v = (uint) value;   // support negative numbers
    while (v >= 0x80) {
        Write((byte) (v | 0x80));
        v >>= 7;
    }
    Write((byte)v);
}

Looking at this source, I want to point out that ZigZag encoding may be more efficient for signed integers which contain negative values, rather than casting to uint.

VarInts in these serializers use an algorithm called Little Endian Base-128 or LEB128, which encodes up to 7 bits per byte. It uses the most significant bit of each byte to indicate whether or not another byte follows (1 = yes, 0 = no). This is a simple format but it may not be the fastest. It might turn out that PrefixVarint is faster. With PrefixVarint, all of those 1s from LEB128 are written in one shot, at the beginning of the payload. This may let us use hardware intrinsics to improve the speed of this encoding & decoding. By moving the size information to the front, we may also be able to read more bytes at a time from the payload, reducing internal bookkeeping and improving performance. If someone wants to implement this in C#, I will happily take a PR if it turns out to be faster.

Hopefully you've found something useful in this post. Let me know if something is unclear or you have something to add. Since I started writing this, I've moved to Redmond and officially joined Microsoft on the Orleans team, working on some very exciting things.