How I may help you
In addition to the below I’m happy to discuss any of the topics on my website and what brought you here!
Please drop me an e-mail at io@richiejp.com to setup a call.
Systems
Something I have learned in a very painful way is that systems skill is so rare there are few who even recognise it. It’s not well defined and occupies multiple categories. It’s not embedded development, but can include it, it’s not kernel, database, cryptographic or web server development, but can include it.
It encompasses all the code that is needed so that the application code can run. Where is the line between system and application? There isn’t one, it is a continuum and when you zoom in on the boundary zone it disappears.
Systems developers do every type of coding there is. The exception being domain specific code that requires some scientific or artistic ability from a deep field. Systems developers can write assembly, they can implement algorithms and they can write UI code (abeit perhaps at gun point).
A systems developer can see the entire software stack from hardware to user interface code. They can architect and debug any part of the stack. Special emphasis on “debug”, this is perhaps the rarest skill of all amongst software developers.
A systems developer can not necessarily do all of this instantly. They can however bring it all together in a reasonable time. An expert systems developer knows when to make decisions liberally and when to slow down and be careful.
An expert systems developer dynamically adjusts their approach based on the security and performance requirements on the area of code in question. They are aware how their own and other’s knowledge varies across the code.
I am an expert systems developer.
Technologies and experience
Below are some technologies I have worked with.
OS Kernels
I know Linux having written an exploit for it and spending 6+ years writing kernel tests for SUSE. I also have knowledge of more niche systems such as FreeBSD and the Nanos unikernel which I contributed to.
A.I. and Kubernetes
I wrote a large part of an operator (in Go) which serves AI models (mainly LLMs) using software such as LocalAI, vLLM, DeepSpeed-MII and Triton.
A.I. and databases
I added an in-memory vector store to LocalAI which can be used to perform vector search on embeddings.
Web
Having developed Gripe.sh and DoBu.uk I am familiar with the following.
- Svelte with TypeScript and NodeJS
- TailwindCSS
- Redis/KeyDB
- Fly.io
- Containers
Protocols
I partially implemented HTTP/2 in Zig including HPACK and I wrote a minimal MsgPack implementation for the Linux Test Executor prototype. In addition I have some knowledge of Bluetooth, WiFi, Ethernet, TCP/IP, UDP, TLS, ASN.1, NFS etc.
Languages
This is not the most important thing if you are looking for someone like me. Learning a new language is relatively easy compared to other challenges in systems development.
Having said that, the three major languages I am most familiar with are
- Go
- C
- JavaScript/TypeScript
Less well known are
- Julia
- Zig
To name a few; in the past I have used
- Go
- Perl
- C++ (QT)
- Rust
- C# (8+ years ago)
- Python
- Lisp
Computer Science
I am aware of cache efficient data structures and big O notation. I know the basics of compiler and operating system theory. There are problems which require theory (e.g. random number generation) and I can find the relevant material.
Projects
Here are some brief case studies of projects I have worked on and particular problems that I solved.
Systems
Kubernetes A.I operator
A kubernetes operator which manages A.I deployments. Not an A.I which manages Kubernetes, at least for now anyway.
Linux Test Project
Mostly a collection of Linux kernel tests. I have made many contributions to this, including a lot of code review and leveraged it within SUSE to test the Linux kernel.
Vulnerability (CVE) testing
- Problem: Fixes for bugs which cause vulnerabilities are sometimes broken
- Solution: Create reliable reproducers for those bugs which validate the fixes
- Benefit: Some bad fixes are detected which had the second order effect of highlighting broken procedures in how fixes are backported.
- Skills: Linux Kernel, tracing, debugging, testing, C
Creating reproducers is not a new concept. However I rebooted efforts to get more reproducers into the LTP. This encouraged others to contribute as well.
FuzzSync race exposition library
This leads on from the previous one.
- Problem: Reproducing some bugs requires reliably reproducing a data race. The usual methods of doing this are either resource intensive or require a particular kernel.
- Solution: Create a library which makes that easier
- Benefit: We can easily reproduce most bugs involving a data race without resorting to tricks that require a particular kernel config.
CGroup API
Control groups have emerged as a critical kernel interface. They are used by container, VM and system managers. As part of a larger effort to increase test coverage of them I increased LTP’s support.
- Problem: It’s difficult to write tests which interact with both Kernel CGroup APIs V1 and V2. Also to discover the existing CGroup setup created by, for e.g., systemd.
- Solution: Create a compatability layer which abstracts controller discovery, CGroup creation and interactions.
- Benefit: It is far easier now to write tests which interact with CGroups, for example cfs_bandwidth01 which I wrote. More importantly it encourages others to write tests interacting with CGroups.
Sparse static analysis
- Problem: We encounter repetitive mistakes during review especially around LTP library usage
- Solution: Implement our own C static analysis tool based on Sparse. So far only 3 checks were implemented
- Benefit: 3 less problems around improper usage of the API. A better experience for contributors and maintainers.
Improving the new user/contributor experience
- Problem: It wasn’t immediately clear how to run the tests or write a new one
- Solution: Add a quick start and other documentation
- Benefit: More new users and contributors assuming it wasn’t a coincidence
eBPF testing and more
I have introduced other areas of testing which follow the same pattern as above.
- Problem: It’s difficult to test some kernel feature consistently across different systems.
- Solution: Introduce some supporting code into the LTP test framework
- Benefit: Increased test coverage of the kernel and better feedback for kernel developers
Linux Kernel
I have found and fixed a number of bugs in the Linux kernel. The reasons for doing this are rather roundabout.
- Problem: I/we don’t fully understand the challenges in fixing a kernel bug
- Solution: Personally fix some kernel bugs
- Benefit: First hand experience of what is required or nice to have when fixing a bug which and what are the challenges of testing during development. Leading to better testing.
CAN and SLIP
Found and fixed some issues in CAN and SLIP which are potentially exploitable:
b9258a2cece4
slcan: Don’t transmit uninitialized stack data in padding0ace17d56824
can, slip: Protect tty->disc_data in write_wakeup and close with RCU
vsock
4c1e34c0dbff
vsock: Enable y2038 safe timeval for timeout685c3f2fba29
vsock: Refactor vsock_*_getsockopt to resemble sock_getsockopt
memcg/slab
I found and suggested a fix for a bug in the memory CGroup. It was refused in favor of a more general fix.
mm: memcg/slab: Stop reparented obj_cgroups from charging root
OpenQA
Full operating system testing framework.
New QEMU backend
The QEMU backend manages VMs used for testing. OpenQA makes use of snapshotting to revert a VM to a good state (anchor point) when something fails and continue testing.
- Problem: Snapshotting was slow and would fail under many circumstances. Additionally there were many smaller issues, such as exporting UEFI firmware variables. This made some testing impossible.
- Solution: Rewrite the QEMU backend
- Benefits: Our test matrix could be expanded by a considerable amount. Increasing coverage and finding more problems.
Real serial console
- Problem: Originally being purely a GUI testing framework, serial consoles were supported in a very roundabout way
- Solution: Directly support serial consoles (primarily virtio in QEMU)
- Benefits: Dramatic decrease in test runtime which means developers get test results faster and have more time to react to failures. It also decreased resource usage allowing more testing.
LTP test runner
- Problem: Only some LTP tests were being run, many were failing and determining whether it was due to a product bug or test bug was difficult
- Solution: Write a test runner for LTP within OpenQA that displayed test results well and provided debugging aids.
- Benefits: A big expansion in test coverage, kernel bugs were reported with useful info and LTP tests could be flagged for fixing.
JDP
Now defunct data analysis framework in Julia which I planned to use to create an “ontology” (to steal Palantir’s wording) that would allow bug reports, bug fixes, test failures and logs from various sources to be automatically cross referenced.
- Problem: An LTP test has failed; it’s not immediately obvious what all the relevant context may be.
- Solution: Use whatever algorithms to scour all our data and find any information that may be relevant.
- Benefits: It automatically matched test failures with existing bug reports saving a lot of review time
Web
Gripe.sh
A way to efficiently record time wasting events. It’s firstly an experiment to learn about semantic search and Go. Secondly I wanted to create a web app rapidly and refine my stack for doing so. Please see Gripe.sh for a video.
DoBu.uk
A SaaS which I have written about extensively and candidly. Both here and on IndieHackers.
To summarise:
- Problem: christineharp.co.uk needed to display her availability without taking bookings.
- Solution: I worked with her to create a product which does that.
- Benefit: She now has a beautiful calendar
I’m obsessed with latency, download size and minimising attack surface. These are not things that are very important to the app except for the core calendar component. If I started again I would indulge these obsessions by writing the core service within strict constraints. Then create all of the CRUD in nocode or similar.
While creating this I got into some interesting problems. For example:
- Problem: I wanted to run my NodeJS app on Nanos Unikernel because
it results in a very small and efficient VM with limited attack
surface. However the version of NodeJS I wanted to use was trying to
use the
clone3
system call. - Solution: Implement
clone3
in the Nanos kernel - Benefit: I now have the option of using Nanos with NodeJS
OK, I hold my hands up, I didn’t need to do that. However if you operating at scale, these things can bring a lot of benefit because you are not deploying and running unnecessary code.