My day job is writing a database engine named RavenDB. My preferred language to code in is C#. Those two facts tend to cause glitches in some people's minds. How can you write a database engine, which is all about low-level system programming, in a high-level language such as C#? How much performance are you leaving on the table by not choosing a proper system-level language, such as C++ or Rust?

It turns out that there isn't a conflict between the two. You can write high-performance code in C# that matches or beats the performance of native code. The best part is that you can achieve that while retaining many of the advantages that C# offers in the first place.

C# has a lot of expressiveness and high-level concepts, but from the get-go, it has some critical features for low-level systems. Value types, pointers, and unsafe code have all been part of C# from its initial creation. Recent additions to the .NET platform, such as spans, vector instructions, and hardware intrinsics make it much simpler to write high-performance and low-level code.

I want to be able to talk in concrete terms, so for the rest of this article, I'm going to focus on a relatively straightforward task. The code accepts a CSV file (gzipped) and computes aggregated statistics based on the data. You can view a sample of the data in the file in Table 1.

I generated a large file to test how well this code performs (the script to generate the data is available online at: https://gist.github.com/ayende/e84dac76e5dc023f6a80367f6c01ac13), and obtained a file with 250 million records with 9.78 million unique users. The compressed file size is about 2.4 GB. I intentionally chose a large size, which should clearly demonstrate the difference in performance between approaches.

The task at hand is to compute the total quantity and price for each user in the file. This is a simple task and the code to write it is almost painfully obvious. The input is a GZipped CSV file, so the first action is to decompress the data and iterate through the file one line at a time, as shown in Listing 1.

Listing 1: Decompressing and yielding the lines from the file.

async IAsyncEnumerable<string> GzipReadlAllLinesAsync(Stream input) {
    using var gzipStream = new GZipStream(input, CompressionMode.Decompress);
    using var reader = new StreamReader(gzipStream);
 
    while (true) {
        string? line = await reader.ReadLineAsync();
        if(line == null)
          break;
        yield return line;
    }
}

For reference, if I were to execute GzipReadlAllLinesAsync(input).CountAsync() alone, it would iterate over the entire file and count the number of lines in it. Executing this code on my machine yields the following results:

Total allocations: 53 GB
Execution time: 47 seconds

This is without doing any actual work, mind. It's just reading through the dataset.

The next task is to actually do something with the data and generate the aggregated results. I'll start from high-level code and write the aggregation using LINQ. Using the System.Linq.Async package, you can apply a LINQ expression to the results of GzipReadlAllLinesAsync(), which will do all the computation in a very clear fashion. You can see how this looks in Listing 2.

Listing 2: Computing total and quantity per user using Linq

ValueTask<Dictionary<long, UserSales>> Linq(Stream input) {
    return (from l in GzipReadlAllLinesAsync(input)
      .Skip(1) // skip header line
        let flds = l.Split(',')
        let item = new {
            UserId = long.Parse(flds[0]),
            Quantity = int.Parse(flds[2]),
            Price = decimal.Parse(flds[3])
        }
        group item by item.UserId into g
        select new {
            UserId = g.Key,
            Quantity = g.SumAsync(x => x.Quantity),
            Total = g.SumAsync(x => x.Price)
        }).ToDictionaryAsync(x => x.UserId, x => new UserSales {
            Quantity = x.Quantity.Result,
            Total = x.Total.Result
        });
}

Before I discuss performance, I'd like you to take a look at Listing 2 and see how I was able to write precisely what I wanted in an extremely succinct and expressive manner. The fact that I can do that so simply is one of the great strengths of C#, in my opinion. But good-looking code is just the first stage, let's see how well it performs at runtime.

When evaluating the performance of a piece of code, it's important not to look at just a single parameter. It's easy to trade off execution time for memory usage, for example. I'm running all the code in this article using .NET 8.0 in release mode. Here are the execution results:

Peak memory: 18 GB
Total allocations: 163 GB
Execution time: 8 minutes, 19 seconds

Why am I mentioning memory and allocations in my metrics? The value of “total allocations” refers to the amount of memory the system has allocated over its lifetime, while “peak memory” represents the maximum amount of memory used at any given point in time. When talking about .NET programs, reducing allocations tends to have an outsized impact on performance, especially if you're running with multiple threads at once. That's because the more allocations you do, the more work the garbage collector has to handle. With multiple threads all allocating memory, it's easy to reach a point where you're mostly burning CPU cycles on garbage collection.

The compressed file size is 2.4 GB, but uncompressed, the data is about 3.7 GB. Given the amount of time that this code took to run, it was processing under 8 MB a second. That's really bad. The reason for that is that the code in Listing 2 is executing its operations in a sequential manner. First, it runs through all the records in the file, grouping them by the user ID, and only then does it run the aggregation for each user. That means that it has to retain the entire dataset in memory until it completes the process.

Instead of using LINQ, you can choose to write the aggregation code yourself. You can see the code for this in Listing 3 where you're basically doing the aggregation inline. For each record, you look up the relevant user statistics and update them. You only need to keep track of the users' statistics, not the entire data set.

Listing 3: Manually aggregating over the records

async Task<Dictionary<long, UserSales>> 
  StreamReaderAndDictionary(Stream input) {
   
      var sales = new Dictionary<long, UserSales>();
      await foreach(var line in GzipReadlAllLinesAsync(input).Skip(1)) {

          var fields = line.Split(',');
          var uid = long.Parse(fields[0]);
          int quantity = int.Parse(fields[2]);
          decimal price = decimal.Parse(fields[3]);

          if (!sales.TryGetValue(uid, out var stats))
            sales[uid] = stats = new UserSales();

          stats.Total += price * quantity;
          stats.Quantity += quantity;
      }
      return sales;
}

What's the result of executing this code, then? It's going to use less memory, but how much faster is it going to be?

Peak memory: 917 MB
Total allocations: 126,999 MB (127 GB)
Execution time: 1 minute, 51 seconds

The amount of code that you wrote isn't significantly different between Listing 2 and Listing 3. Both are pretty clear and obvious in how they do things. However, you saved almost four-fifths (!) of the runtime and are using about 5% (!) of the memory. That is an amazing difference by all accounts.

It's also interesting to realize that the total amount of allocation isn't that drastically different. It's 163 GB vs. 127 GB, even though all the other metrics are far better. Why is that? Most of the allocations here are fairly unavoidable because you read the lines from the file one line at a time and generate a lot of strings. Additional strings are also allocated during the parsing of each line because you use Split(',').

The code in Listing 3 has far better performance than the initial version, but it can still improve. A key issue is the manner in which you process the data from the file. You're creating a lot of strings in the process, which is a big waste; you can do better. Instead of allocating strings for each line of the file, (remember, there are 250,000,000 of them), you can use a lower-level API.

System.IO.Pipelines is one such low-level API that's meant to provide an efficient way to read (and write) data in .NET. In particular, it uses a single reusable buffer to do the work and allows you to access it directly instead of performing allocations. The API is somewhat less convenient to use, but the performance difference more than makes up for this. Take a look at the code in Listing 4, showing the skeleton of using Pipelines for processing the file.

Listing 4: Reading from the file using PipeReader

async ValueTask<Dictionary<long, UserSalesStruct>>
  PipelineAsync(Stream input) {
    using var gzipStream = new GZipStream(input, CompressionMode.Decompress);
    var pipeReader = PipeReader.Create(gzipStream, 
      new StreamPipeReaderOptions(bufferSize: 64 * 1024));

    var header = false;
    var salesData = new Dictionary<long, UserSalesStruct>();
    while (true) {
        var valueTask = pipeReader.ReadAsync();
        var read = valueTask.IsCompleted ? valueTask.Result :
          await valueTask.AsTask();
         ProcessLines(pipeReader, read, salesData, ref header);
         if (read.IsCompleted)
         break;
    }

    return salesData;
}

Unlike previous code listings, the code in Listing 4 isn't actually doing much; most of the work is handled by the ProcessLines() method, which will be covered shortly. What's interesting here is that all the I/O is handled in the parent method. You ask the pipe to read a buffer from the stream and then process it. A key aspect is that the buffer isn't limited to a single line or is guaranteed to be on a line boundary. Instead, you get some buffer from the file (64KB in this case) and process everything in it.

The really nice thing about the System.IO.Pipeline approach is that it encourages a highly efficient manner for processing incoming data. Let's take a look at Listing 5 where I actually do that and then I'll discuss exactly what is going on here.

Listing 5: Processing lines in the buffer

void ProcessLines(PipeReader pipeReader, ReadResult readResult,
  Dictionary<long, UserSalesStruct> salesData, ref bool header) {
   
    var sr = new SequenceReader<byte>(readResult.Buffer);
    while (true)  {
        ReadOnlySequence<byte> line;
        if (sr.TryReadTo(out line, (byte)'\n') == false) {
            pipeReader.AdvanceTo(consumed: sr.Position,
              examined: readResult.Buffer.End);
        break;
        }
        if (header == false) {
            header = true;
            continue;
        }
        ProcessSingleLine(salesData, line);
    }
}

The code in Listing 5 isn't really doing much, but it's one of those cases where I removed everything non-essential to expose the core beauty. The PipeReader is going to read from the stream into a buffer, which is held by the ReadResult. You use a SequenceReader<byte> to search the current line in the buffer and return a reference (to the same original buffer) to the range of the line. If you can't find a line break, that means that you need to read more from the stream, so you inform the pipe reader how far you read and until what point you've examined the buffer. The next call to ReadAsync() (shown in Listing 4), will then know what can be discarded and what more needs to be read from the stream.

There isn't a lot of code here, but from an architectural perspective, there's a lot going on. The end result is that you can use a single reusable buffer to process the data. The ProcessLines() method also contains a loop. It's going to be running over the entire buffer, which is very likely to contain many lines. Instead of operating on a per-line basis, it's now operating over a far bigger batch of items. That also has implications with regard to performance, because the hot loop needs to cover a lot less code to do its job. The actual processing of the lines is shown in Listing 6, which also contains some interesting code patterns to talk about.

Listing 6: Processing a single line

[MethodImpl(MethodImplOptions.AggressiveInlining)]
void ProcessSingleLine(
   Dictionary<long, UserSalesStruct> salesData,
   ReadOnlySequence<byte> line)
{
    var lr = new SequenceReader<byte>(line);
    ReadOnlySpan<byte> span;
    var readAll = lr.TryReadTo(out span, (byte)',')
        & Utf8Parser.TryParse(span, out long userId, out _)
        & lr.TryAdvanceTo((byte)',')
        & lr.TryReadTo(out span, (byte)',')
        & Utf8Parser.TryParse(span, out int quantity, out _)
        & lr.TryReadTo(out span, (byte)',')
        & Utf8Parser.TryParse(span, out decimal price, out _);

    if (readAll == false) throw new InvalidDataException(
      "Couldn't parse expected fields on: " + 
      Encoding.UTF8.GetString(line));

    ref var current = ref CollectionsMarshal
      .GetValueRefOrAddDefault(salesData, userId, out _);
    current.Total += price;
    current.Quantity += quantity;
}

I'm afraid that the code is pretty dense, but the idea is pretty simple. You use another SequenceReader<byte> to find the terminators in the line buffer you're given here, you then use Utf8Parser to parse them without needing to allocate any strings. The code is complicated because it's a single expression and because I wanted to produce the correct error in the case of invalid input. Note that I'm using an unconditional bitwise and (&) versus the more common condition and (&&). This ensures that the entire expression runs and avoids branch instructions. This is the hotspot in the code, so anything that reduces costs is advisable, and branches can be costly (even if well predicted, as in this case).

I'm also not using the Dictionary as you'd normally expect. I'm calling CollectionsMarshal to get or add the value. I'm not sure if you noticed, but Listings 4, 5, and 6 all use UserSalesStruct as the value, instead of UserSales (a class). That change allows you to avoid allocations even further and take advantage of the GetValueRefOrAddDefault() behavior.

A very typical access pattern when using dictionaries is to look up a value by key and create it if it isn't in the dictionary. This requires you to do two dictionary lookups if you need to create the value. The GetValueRefOrAddDefault() means that you only need to do one. Because a struct in C# is guaranteed to be zero initialized, you don't care if the value exists or not and you can immediately add the current record values to the user's sales.

Finally, you might have noticed that the method in Listing 6 has an interesting attribute: [MethodImpl(MethodImplOptions.AggressiveInlining)]. This is an instruction to the Just In Time compiler to force inlining of this method to its caller. This allows you to benefit from clear code and separation of responsibilities while getting dense (and efficient) machine code.

The Pipeline version is spread over three code listings, and I had to reach quite far out of the beaten path. What did you get, then, from this journey? Take a look at these results:

Peak memory: 846 MB
Total allocations: 1069 MB
Execution time: 50.5 seconds

I triple checked those results because they are really good. You can see the full results in Table 2, but the Pipelines method is almost 50% faster than the StreamReader + Dictionary approach you used in Listing 2. It's also literally allocating less than 1% of the memory.

The task of processing a file and aggregating the results is a simple one. The sort of thing that's commonly given as course assignments to students. It's also a pretty good scenario to test our capabilities because it's so limited in scope. The point of this article isn't to point you toward the System.IO.Pipelines API (although I'll admit that it's a favorite of mine), it's to demonstrate how you can entirely change the manner in which you perform a task from the extremely high-level LINQ expression all the way down to reusable buffer and almost no allocations.

C# and .NET allow you to create solutions for a very wide range of scenarios. From building a business application with a focus on correctness and time to market, all the way down to database engines such as RavenDB. In RavenDB, you'll find that there is no such thing as “fast enough.” You can spend literally months optimizing a single piece of code to be just a little bit faster.

You're making a lot of use of seemingly esoteric features such as hardware intrinsics, vector operations, and zero allocation APIs. The nice thing about that is that you can meet your performance goals while still writing in C#.

That isn't just a personal preference of wanting to stay in your comfort zone. I started writing C# code because of the experience of getting an error with a proper error message and a detailed stack trace, rather than some hex code and “you figure it out.” That experience is still very much true today: Using C# as the primary language for RavenDB means that you can use all of the infrastructure around .NET for profiling, debugging, and monitoring. There's also the quality of the tooling around .NET that matters a lot.

Beyond exploring a small way in which you can improve performance by an order of magnitude, I want to emphasize that performance is a journey. You need to align your overall architecture and the algorithms you use with the actual quality of implementation and micro-optimizations. Using RavenDB as an example again, it's common, in certain hotspots, to modify (C#) code in order to get the right machine code for your needs. It's quite amazing that you can do that when you need to. Doubly so, as you can do that for both x64 and ARM machines.

Finally, I'll leave you with this to consider. It's often best to go the other way around. Look at the best practices for high-performance implementations and then walk backward to an architecture that would enable you to use that. With that done, you can then selectively implement the hotspots as needed, without having to struggle with past architectural decisions.

Table 1: Sample data from the orders CSV file.

UserID	ItemID	Quantity	Price	Date
22271148092	5203189368466850000	91	7.03	24/12/2023 11:02:20
22271148092	1618666246057250000	2	11.49	24/12/2023 11:02:20
23392313395	7334974373793960000	1	40.69	24/12/2023 11:02:20
23145412223	2702876167686390000	2	15.56	24/12/2023 11:02:20
23145412223	5203189368466850000	2	7.03	24/12/2023 11:02:20

Table 2: Summary of performance and memory consumption across all scenarios

Scenario	Time (sec)	Allocations (MB)	Peak memory (MB)
Linq	499	166,660	18,286
StreamReader + Dictionary	111	117,272	917
Pipelines	50.5	1,069	846

C# for High-Performance Systems

Published in:

Filed under: