richiejp logo

How to 10x the performance of most software

I have a very large and excellent book called Systems Performance written by Brendon Gregg, who among other things, made Netflix faster. Opening it to a random page I see an overview of what appears to be the TCP stack in the Linux kernel.

Near the beginning of the book there is a chapter describing various methodologies for performance tuning. Starting with what Brendon calls anti-methods.

One method is the random-change anti-method, that is, looking at the code, trying stuff and manually running the program to see if the issue has gone away. Another is the ad-hock-checklist method, which is not referred to as an anti-method, but is listed directly after them.

What I’m about to suggest is somewhat of a combination of these two methods and perhaps the slightest bit of more intelligent analysis. It’s essentially what most engineers who achieve a large performance gain do and it works because of the absolute state of software.

The natural dynamics of the market mean that it is usually safer to quickly write slow software than to slowly write fast software. Not always of course because sometimes the main selling point of an application or component is speed, but even then this is relative to what came before.

Note that this method does benefit massively from familiarity with the project being optimized and the components involved. The more unknowns there are the more a systematic approach will be superior to this.

Method

  1. Isolate the slow feature with a minimum of effort

This could be done with profiling or tracing, but also just using the program and seeing what goes visibly slow, uses up system memory or causes the CPU fan to spin up.

  1. Look at the code or a code trace

Open up the source code and look for slow things (see below) especially anything that happens in a loop.

  1. Rewrite the code to do less work

Can the code be rewritten to not do something or only do it once?

  1. Try the slow feature again

Does it now feel nice?

Things to look for

None of these things are necessarily slow and there is overlap between them. They are just places to start looking assuming it’s not so obvious that you can guess.

  1. Loops and recursion

Unless a loop only ever iterates a small number of times then it should come under some suspicion. If the number of times a loop happens depends on the size of the input and the loop count grows faster than the input size grows then it should come under maximum suspicion.

Operations that would usually be considered fast suddenly become slow and dangerous if they happen inside a loop.

Using the wrong data structure or algorithm can result in unnecessary looping. For example repeatedly searching an unsorted list or an array, instead of sorting the list once and searching with a better algorithm that being sorted enables.

Searching an array instead of using a hash map is another classic example. Often though doing the search is not the problem, doing it repeatedly within another loop is.

Sometimes introducing more loops is faster than caching the results or using a fancy data structure or algorithm. Nevertheless when you see a loop at the top level that could iterate many times this is a place to start looking.

  1. I/O

In particular database queries and creating files. Databases and file systems are typically very very fast, but if you, for example, repeatedly create a temporary file, copy data into it and copy it back out again before deleting it, then you could be thrashing your system.

A problem I’ve often see with databases is not really with the database, but with ORMs (object relation mapping) or even just the database connection library. In fact it’s not even with the ORM, it’s because the ORM is being used to extract an entire database table into memory.

Another problem I have seen a lot is with RPCs (remote procedure calls) and not even because of the network latency, but because the RPC transmits a huge amount of unfiltered JSON that an interpreted language struggles to parse and turn into native objects. The network may not even struggle to send that much JSON, but the memory and CPU requirements to turn it all into a language’s arrays and objects is very high.

Removing I/O altogether can be faster, such as using an in-memory cache or recalculating values when needed, but often not as well. The point is that the interface between an application and the outside world is a sensitive area for performance.

  1. Synchronisation and processes

In application with multiple threads, processes or instances on different servers, you will have points where everything needs to agree.

For example one process will need to take a lock and check a value set by another. In fact it’s not uncommon to see that a process calls sleep in a loop while polling something.

Problems with synchronisation can be more difficult to detect because the program may not be using a lot of CPU time. It may spend a lot of time sleeping waiting for a lock.

Locks and calls to sleep can be hidden in library functions or appear in unexpected places. Problems with synchronisation may only appear in production when running at capacity.

Creating new threads or processes is relatively expensive compared to doing calculations. Something to look out for are external processes being created in a loop. Some languages have a thread pool built in, so seeing a thread being created is not necessarily a bad sign.

  1. Memory allocation and copying memory

The worst crimes involving memory allocation I have seen were done during I/O. However in dynamic languages it can creep in anywhere and cause huge slow downs.

During calculations or something like building a string, you may find that lots of intermediate results are created and they each require temporary allocations. Then the garbage collector has to free all of those objects.

It varies from language to language which objects cause an allocation and how good the runtime is at optimising out intermediate results. Sometimes it can detect that an object can modified in-place instead of creating a copy.

  1. System calls

A system call is like a function call, but it is used to communicate with the operating system kernel. Even the most trivial system call incurs a penalty because the CPU has to swap out the user land context for the kernel context and back again.

If you are using a high-level language then browsing the code for system calls is probably not going to be effective, this is where tooling comes in very handy. Even just using strace on Linux to see if a call is being made repeatedly can do the job.

A lot of the system calls you are likely to see are related to I/O, but possibly from library calls you didn’t think are doing I/O.

  1. Slow CPU instructions and cache invalidation

Most of the time I have to go out of my way to get into this. Basically this is about optimising calculations you can’t avoid. For example my reversible automata render or this hash map.

Once you are at the point of trying to optimise your data structures to avoid loading a new cache line during a calculation, then it’s probably time to consider investing in a real methodology although I’m sure that engineers still make progress by just rewriting code they think will be slow.

The most common optimisation I have used on this level is avoiding modulo (%) and division (/) by making everything a power of 2 and using bitshifts (<<) with logical AND instead.

Most probably you are not going to see a 10 or 100x speed up in the over all program unless it is doing some heavy computation as the main purpose of the program.

When this method goes wrong

This method works because often once you achieve some familiarity with a program and its code base it becomes glaringly obvious where a performance issue is. Fixing it is a case of doing the same thing you have done a bunch of times before.

The problem comes when you mistake what is causing the performance issue. You make a change removing that “unnecessary” work and find nothing happens or things get worse. At least you can’t tell if it made anything better.

It could be possible that your change improved some metrics, but there was no perceptible overall change. Your then left with a decision to reverse the change or keep going with it.

The risk is going in circles without making progress or even causing regressions. If the performance issue is not something you have seen before then methodically eliminating possibilities may be the only way to discover what it is.

An even simpler method

Finally, instead of trying to locate a particular problem and fix it, just rewrite the whole application in a language where everyone is obsessed with performance.

If all of the library authors and runtime contributors think performance is important, then every interface and bit of documentation will drive you towards performance.

This of course only works if the initial language was of the slow variety and I am also joking, but at the same time if the project is very small then it is a workable solution.