What I can help with
Here are a few examples of problems I have solved and their results.
Preventing bugs in the Linux kernel that could effect billions of devices. I implemented a race condition reproducer in the Linux Test Project, transforming the detection of critical data race bugs from an hours long process to seconds.
Reducing the time to execute SUSE’s kernel test suite from over an hour to mintues, by implemented a real serial console and rewriting the QEMU virtualization backend for OpenQA,
Saving performer’s and their customers’ spending several minutes each time they want to check their availability (it adds up). I created a calendar SaaS that automatically calculates and displays an entertainer’s availability from their Google calendar.
How I may help you
Please drop me an e-mail at io@richiejp.com so that we can discuss options and setup a call.
You’ll be talking direct to an engineer who can do the work not an agent or other intermediary.
The process I’d suggest is:
Synchronise: Discuss your problem or ambition and make sure we are speaking the same language.
Plan: Create a brief outline of the project. If it is a longer project then set some initial milestones.
Agreement: Create a contract with the initial milestones or the full work to be completed.
Technologies and experience
Below are some technologies I have worked with. This is not an exhaustive list.
Operating Systems
I know Linux having written an exploit for it and spending 6+ years writing kernel tests for SUSE. I also have knowledge of more niche systems such as FreeBSD and the Nanos unikernel which I contributed to.
A.I. and Kubernetes
I wrote a large part of an operator (in Go) which serves AI models (mainly LLMs) using software such as LocalAI, vLLM, DeepSpeed-MII and Triton.
Containers
I did extensive testing of the technologies underlying contains in the Linux kernel, such as namespaces, virtual networking and CGroups. I also created Ayup which builds containers on the fly using Buildkit.
Web
Having developed Gripe.sh and DoBu.uk I am familiar with the following.
- Svelte with TypeScript and NodeJS
- TailwindCSS
- Redis/KeyDB
- Fly.io
- Containers
Protocols
I partially implemented HTTP/2 in Zig including HPACK and I wrote a minimal MsgPack implementation for the Linux Test Executor prototype. In addition I have some knowledge of Bluetooth, WiFi, Ethernet, TCP/IP, UDP, TLS, ASN.1, NFS etc.
I used gRPC extensively in Ayup.
Languages
This is not the most important thing if you are looking for someone like me. Learning a new language is relatively easy compared to other challenges in systems development.
Having said that, the three major languages I am most familiar with are
- Go
- C
- JavaScript/TypeScript
Less well known are
- Julia
- Zig
To name a few; in the past I have used
- Perl
- C++ (QT)
- Rust
- C# (8+ years ago)
- Python
- Lisp
Computer Science
I am aware of cache efficient data structures and big O notation. I know the basics of compiler and operating system theory. There are problems which require theory (e.g. random number generation) and I can find the relevant material.
Projects
Here are some brief case studies of projects I have worked on and particular problems that I solved.
Systems
Ayup: Automatic build and deployment
Allows you to push code to a remote server where it is automatically built and deploying using Buildkit. Written in Go with some Python for generating code using an LLM.
Kubernetes A.I operator
A kubernetes operator which manages A.I deployments. Not an A.I which manages Kubernetes, at least for now anyway.
Linux Test Project
Mostly a collection of Linux kernel tests. I have made many contributions to this, including a lot of code review and leveraged it within SUSE to test the Linux kernel.
Vulnerability (CVE) testing
- Problem: Fixes for bugs which cause vulnerabilities are sometimes broken
- Solution: Create reliable reproducers for those bugs which validate the fixes
- Benefit: Some bad fixes are detected which had the second order effect of highlighting broken procedures in how fixes are backported.
- Skills: Linux Kernel, tracing, debugging, testing, C
Creating reproducers is not a new concept. However I rebooted efforts to get more reproducers into the LTP. This encouraged others to contribute as well.
FuzzSync race exposition library
This leads on from the previous one.
- Problem: Reproducing some bugs requires reliably reproducing a data race. The usual methods of doing this are either resource intensive or require a particular kernel.
- Solution: Create a library which makes that easier
- Benefit: We can easily reproduce most bugs involving a data race without resorting to tricks that require a particular kernel config.
CGroup API
Control groups have emerged as a critical kernel interface. They are used by container, VM and system managers. As part of a larger effort to increase test coverage of them I increased LTP’s support.
- Problem: It’s difficult to write tests which interact with both Kernel CGroup APIs V1 and V2. Also to discover the existing CGroup setup created by, for e.g., systemd.
- Solution: Create a compatability layer which abstracts controller discovery, CGroup creation and interactions.
- Benefit: It is far easier now to write tests which interact with CGroups, for example cfs_bandwidth01 which I wrote. More importantly it encourages others to write tests interacting with CGroups.
Sparse static analysis
- Problem: We encounter repetitive mistakes during review especially around LTP library usage
- Solution: Implement our own C static analysis tool based on Sparse. So far only 3 checks were implemented
- Benefit: 3 less problems around improper usage of the API. A better experience for contributors and maintainers.
Improving the new user/contributor experience
- Problem: It wasn’t immediately clear how to run the tests or write a new one
- Solution: Add a quick start and other documentation
- Benefit: More new users and contributors assuming it wasn’t a coincidence
eBPF testing and more
I have introduced other areas of testing which follow the same pattern as above.
- Problem: It’s difficult to test some kernel feature consistently across different systems.
- Solution: Introduce some supporting code into the LTP test framework
- Benefit: Increased test coverage of the kernel and better feedback for kernel developers
Linux Kernel
I have found and fixed a number of bugs in the Linux kernel. The reasons for doing this are rather roundabout.
- Problem: I/we don’t fully understand the challenges in fixing a kernel bug
- Solution: Personally fix some kernel bugs
- Benefit: First hand experience of what is required or nice to have when fixing a bug which and what are the challenges of testing during development. Leading to better testing.
CAN and SLIP
Found and fixed some issues in CAN and SLIP which are potentially exploitable:
b9258a2cece4
slcan: Don’t transmit uninitialized stack data in padding0ace17d56824
can, slip: Protect tty->disc_data in write_wakeup and close with RCU
vsock
4c1e34c0dbff
vsock: Enable y2038 safe timeval for timeout685c3f2fba29
vsock: Refactor vsock_*_getsockopt to resemble sock_getsockopt
memcg/slab
I found and suggested a fix for a bug in the memory CGroup. It was refused in favor of a more general fix.
mm: memcg/slab: Stop reparented obj_cgroups from charging root
OpenQA
Full operating system testing framework.
New QEMU backend
The QEMU backend manages VMs used for testing. OpenQA makes use of snapshotting to revert a VM to a good state (anchor point) when something fails and continue testing.
- Problem: Snapshotting was slow and would fail under many circumstances. Additionally there were many smaller issues, such as exporting UEFI firmware variables. This made some testing impossible.
- Solution: Rewrite the QEMU backend
- Benefits: Our test matrix could be expanded by a considerable amount. Increasing coverage and finding more problems.
Real serial console
- Problem: Originally being purely a GUI testing framework, serial consoles were supported in a very roundabout way
- Solution: Directly support serial consoles (primarily virtio in QEMU)
- Benefits: Dramatic decrease in test runtime which means developers get test results faster and have more time to react to failures. It also decreased resource usage allowing more testing.
LTP test runner
- Problem: Only some LTP tests were being run, many were failing and determining whether it was due to a product bug or test bug was difficult
- Solution: Write a test runner for LTP within OpenQA that displayed test results well and provided debugging aids.
- Benefits: A big expansion in test coverage, kernel bugs were reported with useful info and LTP tests could be flagged for fixing.
JDP
Now defunct data analysis framework in Julia which I planned to use to create an “ontology” (to steal Palantir’s wording) that would allow bug reports, bug fixes, test failures and logs from various sources to be automatically cross referenced.
- Problem: An LTP test has failed; it’s not immediately obvious what all the relevant context may be.
- Solution: Use whatever algorithms to scour all our data and find any information that may be relevant.
- Benefits: It automatically matched test failures with existing bug reports saving a lot of review time
Web
Gripe.sh
A way to efficiently record time wasting events. It’s firstly an experiment to learn about semantic search and Go. Secondly I wanted to create a web app rapidly and refine my stack for doing so. Please see Gripe.sh for a video.
DoBu.uk
A SaaS which I have written about extensively and candidly. Both here and on IndieHackers.
To summarise:
- Problem: christineharp.co.uk needed to display her availability without taking bookings.
- Solution: I worked with her to create a product which does that.
- Benefit: She now has a beautiful calendar
I’m obsessed with latency, download size and minimising attack surface. These are not things that are very important to the app except for the core calendar component. If I started again I would indulge these obsessions by writing the core service within strict constraints. Then create all of the CRUD in nocode or similar.
While creating this I got into some interesting problems. For example:
- Problem: I wanted to run my NodeJS app on Nanos Unikernel because
it results in a very small and efficient VM with limited attack
surface. However the version of NodeJS I wanted to use was trying to
use the
clone3
system call. - Solution: Implement
clone3
in the Nanos kernel - Benefit: I now have the option of using Nanos with NodeJS
OK, I hold my hands up, I didn’t need to do that. However if you operating at scale, these things can bring a lot of benefit because you are not deploying and running unnecessary code.