How to 10x the performance of most software

I have a very large and excellent book called Systems Performance written by Brendon Gregg, who among other things, made Netflix faster. Opening it to a random page I see an overview of what appears to be the TCP stack in the Linux kernel.

Near the beginning of the book there is a chapter describing various methodologies for performance tuning. Starting with what Brendon calls anti-methods.

One method is the random-change anti-method, that is, looking at the code, trying stuff and manually running the program to see if the issue has gone away. Another is the ad-hock-checklist method, which is not referred to as an anti-method, but is listed directly after them.

What I’m about to suggest is somewhat of a combination of these two methods and perhaps the slightest bit of more intelligent analysis. It’s essentially what most engineers who achieve a large performance gain do and it works because of the absolute state of software.

The natural dynamics of the market mean that it is usually safer to quickly write slow software than to slowly write fast software. Not always of course because sometimes the main selling point of an application or component is speed, but even then this is relative to what came before.

Note that this method does benefit from familiarity with the project in question and the components involved. The more unknowns there are the more a systematic approach will be superior to this.

10x is easier than 2x

There is a business book I haven’t read called “10x Is Easier than 2x”. The idea being that it is easier to aim from a 10x increase in revenue than a 2x.

When optimizing software there is some truth to the idea that focusing on ideas that will, in a single blow, increase performance by 10x, is easier than trying to find multiple 1-2x performance increases and compounding them.

Due to things such as measurement bias, noise and the general difficulty of setting up a benchmark environment that can accurately detect small increases in performance; if there is a radical thing you can do to increase performance by an order of magnitude or more, it could be easier to do that.

A single sample can be statistically significant if the variance is large enough. Meaning that if a process that previously took hours is reduce to seconds, then you can be fairly sure your performance optimization worked without an advanced benchmarking setup.

You of course need to test for functionality regressions and edge cases which result in a timeout, but you don’t need to worry about whether a speed up is due to an environmental change or noise.

What’s more the changes required are likely to be radical to the extent that pure logic or theory can be used to predict what the relative performance outcome will be. You also won’t have as many options to choose from and will be forced to focus on things that are high impact.

Meanwhile a radical change isn’t necessarily difficult. To some extent this is just a filtering exercise; you look for the low hanging fruit that’s especially juicy, ignoring stuff that is too difficult or low impact.

Logically it is not easier to 10x than to 2x because there are multiple 2x’s in a 10x. The point, I guess, is to discard tasks that won’t result in a big improvement because there are better things to be doing.

So something to ask yourself before trying to improve performance is will a 10x improvement in performance result in a similar sized cost saving or improvement in sales? The later is often very hard to measure, but you can pick some other metric as a proxy.

Code optimization method

Isolate the slow feature with a minimum of effort

This could be done with profiling or tracing tools, but also just using the program and seeing what is visibly slow, uses up system memory, spends credits on AWS or causes the CPU fan to spin up.

Look at the code or a code trace

Open up the source code and look for slow things (see below) especially anything that happens in a loop. Stepping through the code using debugging tools or a profiler can help, but it is also good to read the code and think.

Rewrite the code to do less work

Can the code be rewritten to not do something or only do it once?

Try the slow feature again or run a simple benchmark

Does it now feel nice when using the program? Do a few runs of the benchmark show a radical improvement?

Because we are looking for very large changes we don’t need advanced benchmarking tools for performance, however you may need to check that performance doesn’t violently degrade with certain inputs.

Things to look for when optimizing code

None of these things are necessarily slow and there is overlap between them. They are just places to start looking assuming it’s not so obvious that you can guess.

Loops and recursion (i.e. the algorithm)

Unless a loop only ever iterates a small number of times then it should come under some suspicion. If the number of times a loop happens depends on the size of the input and the loop count grows faster than the input size grows then it should come under maximum suspicion.

Operations that would usually be considered fast suddenly become slow and dangerous if they happen inside a loop.

Using the wrong data structure or algorithm can result in unnecessary looping. For example repeatedly searching an unsorted list or an array, instead of sorting the list once and searching with a better algorithm that being sorted enables.

Searching an array instead of using a hash map is another classic example. Often though doing the search is not the problem, doing it repeatedly within another loop is.

Sometimes introducing more loops is faster than caching the results or using a fancy data structure or algorithm. Nevertheless when you see a loop at the top level that could iterate many times this is a place to start looking.

In particular database queries and creating files. Databases and file systems are typically very very fast, but if you, for example, repeatedly create a temporary file, copy data into it and copy it back out again before deleting it, then you could be thrashing your system.

A problem I’ve often see with databases is not really with the database, but with ORMs (object relation mapping) or even just the database connection library. In fact it’s not even with the ORM, it’s because the ORM is being used to extract an entire database table into memory.

Another problem I have seen a lot is with RPCs (remote procedure calls) and not even because of the network latency, but because the RPC transmits a huge amount of unfiltered JSON that an interpreted language struggles to parse and turn into native objects. The network may not even struggle to send that much JSON, but the memory and CPU requirements to turn it all into a language’s arrays and objects is very high.

Removing I/O altogether can be faster, such as using an in-memory cache or recalculating values when needed, but often not as well. The point is that the interface between an application and the outside world is a sensitive area for performance.

Synchronisation and processes

In application with multiple threads, processes or instances on different servers, you will have points where everything needs to agree.

For example one process will need to take a lock and check a value set by another. In fact it’s not uncommon to see that a process calls sleep in a loop while polling something.

Problems with synchronisation can be more difficult to detect because the program may not be using a lot of CPU time. It may spend a lot of time sleeping waiting for a lock.

Locks and calls to sleep can be hidden in library functions or appear in unexpected places. Problems with synchronisation may only appear in production when running at capacity.

Creating new threads or processes is relatively expensive compared to doing calculations. Something to look out for are external processes being created in a loop. Some languages have a thread pool built in, so seeing a thread being created is not necessarily a bad sign.

Memory allocation and copying memory

The worst crimes involving memory allocation I have seen were done during I/O. However in dynamic languages it can creep in anywhere and cause huge slow downs.

During calculations or something like building a string, you may find that lots of intermediate results are created and they each require temporary allocations. Then the garbage collector has to free all of those objects.

It varies from language to language which objects cause an allocation and how good the runtime is at optimising out intermediate results. Sometimes it can detect that an object can modified in-place instead of creating a copy.

System calls

A system call is like a function call, but it is used to communicate with the operating system kernel. Even the most trivial system call incurs a penalty because the CPU has to swap out the user land context for the kernel context and back again.

If you are using a high-level language then browsing the code for system calls is probably not going to be effective, this is where tooling comes in very handy. Even just using strace on Linux to see if a call is being made repeatedly can do the job.

A lot of the system calls you are likely to see are related to I/O, but possibly from library calls you didn’t think are doing I/O.

Slow CPU instructions and cache invalidation

Most of the time I have to go out of my way to get into this. Basically this is about optimising calculations you can’t avoid. For example my reversible automata render or this hash map.

Once you are at the point of trying to optimise your data structures to avoid loading a new cache line during a calculation, then it’s probably time to consider investing in a real methodology although I’m sure that engineers still make progress by just rewriting code they think will be slow.

The most common optimisation I have used on this level is avoiding modulo (%) and division (/) by making everything a power of 2 and using bitshifts (<<) with logical AND instead.

Most probably you are not going to see a 10 or 100x speed up in the over all program unless it is doing some heavy computation as the main purpose of the program.

Examples

Adding an index to a CSV viewer

In this case an index was added which meant that navigating a very large CSV file went from linear time to constant. Potentially an infinite increase in performance, but practically it reduces lag time from seconds to milliseconds.

Fuzzy Sync race exposition library

You perhaps wouldn’t think of this as performance optimization, but this library allowed us to reproduce race condition bugs in the Linux kernel in seconds where previously it took hours, days or perhaps even never.

Generating deserialiser code for BSON in Julia

To some extent this is an example of what not to do as it only resulted in a 3-4x speed up. On long running tasks the cost saving was clear, but it’s questionable whether I should have been (de)serialising so much data to in the first place.

When this optimization method goes wrong

This method works because often once you achieve some familiarity with a program and its code base it becomes glaringly obvious where a performance issue is. Fixing it is a case of doing the same thing you have done a bunch of times before.

The problem comes when you mistake what is causing the performance issue or dramatically over estimate how much impact a change will have. Let’s say you make a change removing that “unnecessary” work and find nothing happens or things get worse.

It could be possible that your change improved some metrics, but there was no perceptible overall change. Your then left with a decision to reverse the change or keep going with it.

The risk is going in circles without making progress or even causing regressions.

If you are in a situation where a single radical change appears to be impossible or very costly. Then you need to compound multiple small changes and use a process which systematically validates each change in a statistically sound way.

Modifying the code to test a hypothesis may not be feasible, it could be too time consuming or you may need to optimize for one metric at a time. Meaning you need to instrument the system and identify bottlenecks. For example if you think of a change that would improve the instruction cache usage, you need evidence that it is a limiting factor and that a change did in improve it.

An even simpler way

Finally, instead of trying to locate a particular problem and fix it, just rewrite the whole application in a language where everyone is obsessed with performance.

If all of the library authors and runtime contributors think performance is important, then every interface and bit of documentation will drive you towards performance.

This of course only works if the initial language was of the slow variety and I am also joking, but at the same time if the project is very small then it is a workable solution.