richiejp logo

Software engineering services

What I can help with

Here are a few examples of problems I have solved and their results.

How I may help you

Please drop me an e-mail at io@richiejp.com so that we can discuss options and setup a call.

You’ll be talking direct to an engineer who can do the work not an agent or other intermediary.

The process I’d suggest is:

  1. Synchronise: Discuss your problem or ambition and make sure we are speaking the same language.

  2. Plan: Create a brief outline of the project. If it is a longer project then set some initial milestones.

  3. Agreement: Create a contract with the initial milestones or the full work to be completed.

Technologies and experience

Below are some technologies I have worked with. This is not an exhaustive list.

Operating Systems

I know Linux having written an exploit for it and spending 6+ years writing kernel tests for SUSE. I also have knowledge of more niche systems such as FreeBSD and the Nanos unikernel which I contributed to.

A.I. and Kubernetes

I wrote a large part of an operator (in Go) which serves AI models (mainly LLMs) using software such as LocalAI, vLLM, DeepSpeed-MII and Triton.

Containers

I did extensive testing of the technologies underlying contains in the Linux kernel, such as namespaces, virtual networking and CGroups. I also created Ayup which builds containers on the fly using Buildkit.

Web

Having developed Gripe.sh and DoBu.uk I am familiar with the following.

  • Svelte with TypeScript and NodeJS
  • TailwindCSS
  • Redis/KeyDB
  • Fly.io
  • Containers

Protocols

I partially implemented HTTP/2 in Zig including HPACK and I wrote a minimal MsgPack implementation for the Linux Test Executor prototype. In addition I have some knowledge of Bluetooth, WiFi, Ethernet, TCP/IP, UDP, TLS, ASN.1, NFS etc.

I used gRPC extensively in Ayup.

Languages

This is not the most important thing if you are looking for someone like me. Learning a new language is relatively easy compared to other challenges in systems development.

Having said that, the three major languages I am most familiar with are

  • Go
  • C
  • JavaScript/TypeScript

Less well known are

  • Julia
  • Zig

To name a few; in the past I have used

  • Perl
  • C++ (QT)
  • Rust
  • C# (8+ years ago)
  • Python
  • Lisp

Computer Science

I am aware of cache efficient data structures and big O notation. I know the basics of compiler and operating system theory. There are problems which require theory (e.g. random number generation) and I can find the relevant material.

Projects

Here are some brief case studies of projects I have worked on and particular problems that I solved.

Systems

Ayup: Automatic build and deployment

Allows you to push code to a remote server where it is automatically built and deploying using Buildkit. Written in Go with some Python for generating code using an LLM.

Kubernetes A.I operator

A kubernetes operator which manages A.I deployments. Not an A.I which manages Kubernetes, at least for now anyway.

Linux Test Project

Mostly a collection of Linux kernel tests. I have made many contributions to this, including a lot of code review and leveraged it within SUSE to test the Linux kernel.

Vulnerability (CVE) testing

  • Problem: Fixes for bugs which cause vulnerabilities are sometimes broken
  • Solution: Create reliable reproducers for those bugs which validate the fixes
  • Benefit: Some bad fixes are detected which had the second order effect of highlighting broken procedures in how fixes are backported.
  • Skills: Linux Kernel, tracing, debugging, testing, C

Creating reproducers is not a new concept. However I rebooted efforts to get more reproducers into the LTP. This encouraged others to contribute as well.

FuzzSync race exposition library

This leads on from the previous one.

  • Problem: Reproducing some bugs requires reliably reproducing a data race. The usual methods of doing this are either resource intensive or require a particular kernel.
  • Solution: Create a library which makes that easier
  • Benefit: We can easily reproduce most bugs involving a data race without resorting to tricks that require a particular kernel config.
CGroup API

Control groups have emerged as a critical kernel interface. They are used by container, VM and system managers. As part of a larger effort to increase test coverage of them I increased LTP’s support.

  • Problem: It’s difficult to write tests which interact with both Kernel CGroup APIs V1 and V2. Also to discover the existing CGroup setup created by, for e.g., systemd.
  • Solution: Create a compatability layer which abstracts controller discovery, CGroup creation and interactions.
  • Benefit: It is far easier now to write tests which interact with CGroups, for example cfs_bandwidth01 which I wrote. More importantly it encourages others to write tests interacting with CGroups.
Sparse static analysis
  • Problem: We encounter repetitive mistakes during review especially around LTP library usage
  • Solution: Implement our own C static analysis tool based on Sparse. So far only 3 checks were implemented
  • Benefit: 3 less problems around improper usage of the API. A better experience for contributors and maintainers.
Improving the new user/contributor experience
  • Problem: It wasn’t immediately clear how to run the tests or write a new one
  • Solution: Add a quick start and other documentation
  • Benefit: More new users and contributors assuming it wasn’t a coincidence
eBPF testing and more

I have introduced other areas of testing which follow the same pattern as above.

  • Problem: It’s difficult to test some kernel feature consistently across different systems.
  • Solution: Introduce some supporting code into the LTP test framework
  • Benefit: Increased test coverage of the kernel and better feedback for kernel developers

Linux Kernel

I have found and fixed a number of bugs in the Linux kernel. The reasons for doing this are rather roundabout.

  • Problem: I/we don’t fully understand the challenges in fixing a kernel bug
  • Solution: Personally fix some kernel bugs
  • Benefit: First hand experience of what is required or nice to have when fixing a bug which and what are the challenges of testing during development. Leading to better testing.
CAN and SLIP

Found and fixed some issues in CAN and SLIP which are potentially exploitable:

  • b9258a2cece4 slcan: Don’t transmit uninitialized stack data in padding
  • 0ace17d56824 can, slip: Protect tty->disc_data in write_wakeup and close with RCU
vsock
  • 4c1e34c0dbff vsock: Enable y2038 safe timeval for timeout
  • 685c3f2fba29 vsock: Refactor vsock_*_getsockopt to resemble sock_getsockopt
memcg/slab

I found and suggested a fix for a bug in the memory CGroup. It was refused in favor of a more general fix.

mm: memcg/slab: Stop reparented obj_cgroups from charging root

OpenQA

Full operating system testing framework.

New QEMU backend

The QEMU backend manages VMs used for testing. OpenQA makes use of snapshotting to revert a VM to a good state (anchor point) when something fails and continue testing.

  • Problem: Snapshotting was slow and would fail under many circumstances. Additionally there were many smaller issues, such as exporting UEFI firmware variables. This made some testing impossible.
  • Solution: Rewrite the QEMU backend
  • Benefits: Our test matrix could be expanded by a considerable amount. Increasing coverage and finding more problems.
Real serial console
  • Problem: Originally being purely a GUI testing framework, serial consoles were supported in a very roundabout way
  • Solution: Directly support serial consoles (primarily virtio in QEMU)
  • Benefits: Dramatic decrease in test runtime which means developers get test results faster and have more time to react to failures. It also decreased resource usage allowing more testing.
LTP test runner
  • Problem: Only some LTP tests were being run, many were failing and determining whether it was due to a product bug or test bug was difficult
  • Solution: Write a test runner for LTP within OpenQA that displayed test results well and provided debugging aids.
  • Benefits: A big expansion in test coverage, kernel bugs were reported with useful info and LTP tests could be flagged for fixing.

JDP

Now defunct data analysis framework in Julia which I planned to use to create an “ontology” (to steal Palantir’s wording) that would allow bug reports, bug fixes, test failures and logs from various sources to be automatically cross referenced.

  • Problem: An LTP test has failed; it’s not immediately obvious what all the relevant context may be.
  • Solution: Use whatever algorithms to scour all our data and find any information that may be relevant.
  • Benefits: It automatically matched test failures with existing bug reports saving a lot of review time

Web

Gripe.sh

A way to efficiently record time wasting events. It’s firstly an experiment to learn about semantic search and Go. Secondly I wanted to create a web app rapidly and refine my stack for doing so. Please see Gripe.sh for a video.

DoBu.uk

A SaaS which I have written about extensively and candidly. Both here and on IndieHackers.

To summarise:

  • Problem: christineharp.co.uk needed to display her availability without taking bookings.
  • Solution: I worked with her to create a product which does that.
  • Benefit: She now has a beautiful calendar

I’m obsessed with latency, download size and minimising attack surface. These are not things that are very important to the app except for the core calendar component. If I started again I would indulge these obsessions by writing the core service within strict constraints. Then create all of the CRUD in nocode or similar.

While creating this I got into some interesting problems. For example:

  • Problem: I wanted to run my NodeJS app on Nanos Unikernel because it results in a very small and efficient VM with limited attack surface. However the version of NodeJS I wanted to use was trying to use the clone3 system call.
  • Solution: Implement clone3 in the Nanos kernel
  • Benefit: I now have the option of using Nanos with NodeJS

OK, I hold my hands up, I didn’t need to do that. However if you operating at scale, these things can bring a lot of benefit because you are not deploying and running unnecessary code.