How I may help you
In addition to the below I’m happy to discuss any of the topics on my website and what brought you here!
Please drop me an e-mail at io@richiejp.com to setup a call.
Systems
The below are some problems I could help with.
- Develop tests for kernel and system software
- Bug review
- Reproducer development
- Test harness development
- Track down bugs throughout your software stack
- Code review
- Kernel and userland tracing/debugging
- Niche, bespoke or legacy software maintenance
- Solve problems in software there is no expert for
- Explore and document mysterious code
- Introduce testing
- Solve performance and security issues at the source
- Better utilise lower level system components thus removing intermediates
- Select better data structures and algorithms
- Reduce dependencies
- Add sandboxing and containerisation
How does this create value?
- Find problems before they happen in production
- Prioritise problem areas with the highest payoff for introducing testing
- Enable legacy systems to be updated or replaced
- Produce a better user experience by reducing latency
- Improve security by reducing the attack surface
Technologies and experience
Below are some technologies I have worked with.
Systems
I know Linux having spent the last 6+ years writing kernel tests for SUSE. I also have knowledge of more niche systems such as FreeBSD and the Nanos unikernel.
Web
Having developed Gripe.sh and DoBu.uk I am intimately familiar with the following.
- Svelte with TypeScript and NodeJS
- TailwindCSS
- Redis/KeyDB
- Fly.io
- Containers
Protocols
I partially implemented HTTP/2 in Zig including HPACK and I wrote a minimal MsgPack implementation for the Linux Test Executor prototype. In addition I have some knowledge of Bluetooth, WiFi, Ethernet, TCP/IP, UDP, TLS, ASN.1, NFS etc.
Languages
The two major languages I am most familiar with are
- C
- JavaScript/TypeScript
Less well known are
- Julia
- Zig
To name a few; in the past I have used
- Go
- Perl
- C++ (QT)
- Rust
- C# (8+ years ago)
- Python
Computer Science
I am aware of cache efficient data structures and big O notation. I know the basics of compiler and operating system theory. There are problems which require theory (e.g. random number generation) and I can find the relevant material.
Projects
Here are some brief case studies of projects I have worked on and particular problems that I solved.
Systems
Linux Test Project
Mostly a collection of Linux kernel tests. I have made many contributions to this, including a lot of code review and leveraged it within SUSE to test the Linux kernel.
Vulnerability (CVE) testing
- Problem: Fixes for bugs which cause vulnerabilities sometimes do not work.
- Solution: Create reproducers for those bugs which test fix
- Benefit: Some bad fixes are detected which had the second order effect of highlighting broken procedures in how fixes are backported.
Creating reproducers is not a new concept. However I rebooted efforts to get more reproducers into the LTP. This encouraged others to contribute as well.
FuzzSync race exposition library
This leads on from the previous one.
- Problem: Reproducing some bugs requires reliably reproducing a data race. The usual methods of doing this are either resource intensive or require a particular kernel.
- Solution: Create a library which makes that easier
- Benefit: We can easily reproduce most bugs involving a data race without resorting to tricks that require a particular kernel config.
CGroup API
Control groups have emerged as a critical kernel interface. They are used by container, VM and system managers. As part of a larger effort to increase test coverage of them I increased LTP’s support.
- Problem: It’s difficult to write tests which interact with both Kernel CGroup APIs V1 and V2. Also to discover the existing CGroup setup created by, for e.g., systemd.
- Solution: Create a compatability layer which abstracts controller discovery, CGroup creation and interactions.
- Benefit: It is far easier now to write tests which interact with CGroups, for example cfs_bandwidth01 which I wrote. More importantly it encourages others to write tests interacting with CGroups.
Sparse static analysis
- Problem: We encounter repetitive mistakes during review especially around LTP library usage
- Solution: Implement our own C static analysis tool based on Sparse. So far only 3 checks were implemented
- Benefit: 3 less problems around improper usage of the API. A better experience for contributors and maintainers.
Improving the new user/contributor experience
- Problem: It wasn’t immediately clear how to run the tests or write a new one
- Solution: Add a quick start and other documentation
- Benefit: More new users and contributors assuming it wasn’t a coincidence
eBPF testing and more
I have introduced other areas of testing which follow the same pattern as above.
- Problem: It’s difficult to test some kernel feature consistently across different systems.
- Solution: Introduce some supporting code into the LTP test framework
- Benefit: Increased test coverage of the kernel and better feedback for kernel developers
Linux Kernel
I have found and fixed a number of bugs in the Linux kernel. The reasons for doing this are rather roundabout.
- Problem: I/we don’t fully understand the challenges in fixing a kernel bug
- Solution: Personally fix some kernel bugs
- Benefit: First hand experience of what is required or nice to have when fixing a bug which and what are the challenges of testing during development. Leading to better testing.
CAN and SLIP
Found and fixed some issues in CAN and SLIP which are potentially exploitable:
b9258a2cece4
slcan: Don’t transmit uninitialized stack data in padding0ace17d56824
can, slip: Protect tty->disc_data in write_wakeup and close with RCU
vsock
4c1e34c0dbff
vsock: Enable y2038 safe timeval for timeout685c3f2fba29
vsock: Refactor vsock_*_getsockopt to resemble sock_getsockopt
memcg/slab
I found and suggested a fix for a bug in the memory CGroup. It was refused in favor of a more general fix.
mm: memcg/slab: Stop reparented obj_cgroups from charging root
OpenQA
Full operating system testing framework.
New QEMU backend
The QEMU backend manages VMs used for testing. OpenQA makes use of snapshotting to revert a VM to a good state (anchor point) when something fails and continue testing.
- Problem: Snapshotting was slow and would fail under many circumstances. Additionally there were many smaller issues, such as exporting UEFI firmware variables. This made some testing impossible.
- Solution: Rewrite the QEMU backend
- Benefits: Our test matrix could be expanded by a considerable amount. Increasing coverage and finding more problems.
Real serial console
- Problem: Originally being purely a GUI testing framework, serial consoles were supported in a very roundabout way
- Solution: Directly support serial consoles (primarily virtio in QEMU)
- Benefits: Dramatic decrease in test runtime which means developers get test results faster and have more time to react to failures. It also decreased resource usage allowing more testing.
LTP test runner
- Problem: Only some LTP tests were being run, many were failing and determining whether it was due to a product bug or test bug was difficult
- Solution: Write a test runner for LTP within OpenQA that displayed test results well and provided debugging aids.
- Benefits: A big expansion in test coverage, kernel bugs were reported with useful info and LTP tests could be flagged for fixing.
JDP
Now defunct data analysis framework in Julia which I planned to use to create an “ontology” (to steal Palantir’s wording) that would allow bug reports, bug fixes, test failures and logs from various sources to be automatically cross referenced.
- Problem: An LTP test has failed; it’s not immediately obvious what all the relevant context may be.
- Solution: Use whatever algorithms to scour all our data and find any information that may be relevant.
- Benefits: It automatically matched test failures with existing bug reports saving a lot of review time
Web
Gripe.sh
A way to efficiently record time wasting events. It’s firstly an experiment to learn about semantic search and Go. Secondly I wanted to create a web app rapidly and refine my stack for doing so. Please see Gripe.sh for a video.
DoBu.uk
A SaaS which I have written about extensively and candidly. Both here and on IndieHackers.
To summarise:
- Problem: christineharp.co.uk needed to display her availability without taking bookings.
- Solution: I worked with her to create a product which does that.
- Benefit: She now has a beautiful calendar
I’m obsessed with latency, download size and minimising attack surface. These are not things that are very important to the app except for the core calendar component. If I started again I would indulge these obsessions by writing the core service within strict constraints. Then create all of the CRUD in nocode or similar.
While creating this I got into some interesting problems. For example:
- Problem: I wanted to run my NodeJS app on Nanos Unikernel because
it results in a very small and efficient VM with limited attack
surface. However the version of NodeJS I wanted to use was trying to
use the
clone3
system call. - Solution: Implement
clone3
in the Nanos kernel - Benefit: I now have the option of using Nanos with NodeJS
OK, I hold my hands up, I didn’t need to do that. However if you operating at scale, these things can bring a lot of benefit because you are not deploying and running unnecessary code.