Introducing testtrim: Coverage-based Test Targeting

I’ve been working on a project called testtrim, which targets software automated tests for execution based upon previous code-coverage data and git changes. It’s in early development, but it’s looking quite promising with evaluations showing that on-average 90% of tests can be safely skipped with this strategy.

If you’re more inclined to follow this kind of content in a video format, I’ve also published an introductory video (in two formats):

Short Introduction Video, 10 minutes

Deep-dive Introduction Video, 37 minutes

Problem

The longer a good software project lives, the longer its automated testing solution will take to run. This is especially true if the engineers responsible aren’t able to dedicate a lot of time to carefully managing their test suite, pruning it like a Japanese garden into a state of zen minimalism. In my experience, nobody has time for that.

Long automated testing cycles can be managed pretty effectively; they can be parallelized, they can be run on the fastest hardware, changes can be batched together, they can be moved around to different stages of the release cycle to get early likely-failure feedback, and of course, they can be [Ignored] (or .skip or #[ignore] or whatever.)

But can they be made more efficient by only running the things that are relevant to test the change being made?

testtrim’s Equation

Just like you would use to report test coverage (eg. “75% of our code is tested!”), run tests with a coverage tool. But rather than running the entire test suite and reporting generalized coverage, run each individual test to get the coverage for each test.
Invert the data; change “test case touches code” into “code was touched by test case”, and then store it into a database.
Look at a source control diff since the last time you did #2 to find out what changes occurred to the code, then look them up in the database to see what test cases need to be run.

This is the core concept behind testtrim.

How well does it work?

It’s early days for testtrim.

To evaluate how well it worked, I took an Open Source project (alacritty), and I ran the last 100 commits through testtrim.

# Commits:	100
# Commits Successful Build & Test:	66
# Commits Appropriate for Analysis:	53
Average Tests to Run per Commit:	8.4%
Median Tests to Run per Commit:	0.0%
P90 Tests to Run per Commit:	38.9%

For each commit, testtrim identified that an average of only 8.4% of tests needed to be executed to fully test the code change that was being made.

I could list a dozen reasons why this analysis isn’t generalizable… and so I will:

This analysis didn’t include changes to Cargo.lock files (eg. references to external dependencies). I’ve added that capability to testtrim, but I haven’t redone this analysis yet.
alacritty is a pretty mature project that isn’t undergoing rapid development.
This specific measurement didn’t take into account new tests being added to the repo during this period.
alacritty has some tests that work through reading static files; those commits were removed from the analysis, possibly lowering the numbers.
There’s no evidence that a test on a single project will generalize to lots of other projects.
The only guarantee of correctness in this analysis is my own eyeballing of the changes and proposed tests.

But on the other hand, this was just based upon the simple heuristic of “this test touched a file, and that file changed, therefore rerun this test.” I think that could become a lot more sophisticated with more work in the future as well.

I think it’s promising, but not promised.

What’s the future?

Small scale technical items – these prevent testtrim from working well, even in the scope of Rust local apps which are presumably going to be its jam since that’s the only thing it supports today:
- Test execution performance / parallelism – testtrim doesn’t run tests in parallel, so it’s slower than using cargo test even when it runs fewer tests
- Doesn’t work on all Rust codebases – possibly indicative of environmental differences between cargo test and testtrim, but could be other issues
Larger requirements – things that would really unlock value
- Multiple platforms – eg. JavaScript, Java, C#, etc.
- Distributed testing – eg. web application testing – I have a theory on how to do this with Open Telemetry tracing, but no implementation today
Project risks – aside from technical work
- How effective can it really be? – a few data points aren’t quite enough to be confident
- How effective does it have to be for people to want to pick it up and use it? (Cost/Benefit)
  - Even as an Open Source project (which is the plan, today!), it still has costs to implement: complexity, engineering time, operational costs, and maintenance costs

testtrim is under heavy development, and I expect it to move quickly through the end of 2024 and early 2025 into… well, we’ll see. It’s an Open Source project, and I’d love to get feedback from people on the concept, and share in the development as it goes.

Problem

testtrim’s Equation

How well does it work?

What’s the future?

Follow along, share your thoughts!