I’ve been working on a project called testtrim, which targets the execution of software automated tests based upon previous code-coverage data and git changes. The concept was introduced in October 2024, and over the past two months it has made some great progress towards its most interesting goal. It is getting dangerously close to a major milestone of being “self-hosted” – in this context that means being used as the engine to run its own tests in its own continuous integration system. Today I’ll review the goals and challenges currently on the table, and then make an inventory of all the improvements in the tool since the concept introduction.

Table of contents:

Current Project Goals

The core goal of the project is: determine whether this will work for large-scale projects.

In short, the concept is:

  1. Just like you would use to report test coverage (eg. “75% of our code is tested!”), run tests with a coverage tool. But rather than running the entire test suite and reporting generalized coverage, run each individual test to get the coverage for each test.
  2. Invert the data; change “test case touches code” into “code was touched by test case”, and then store it into a database.
  3. Look at a source control diff since the last time you did #2 to find out what changes occurred to the code, then look them up in the database to see what test cases need to be run.

It has been trivial to prove that this will work on microscopic test projects, but real-world projects come with a lot of complexity which muddy the issue up:

  • Tests which read data files, for example for test fixture data
  • Tests which access network resources, which you might want to run always, or you might want to run never, or somewhere in-between
  • Programming environments which allow you to embed data files into code (eg. in Rust, include!() and similar; in Go //go:embed; etc.)
  • Upgrades of third-party dependencies

These are just some of the most obvious problems. In order to uncover more problems, and develop solutions to these problems, testtrim has two medium-term goals:

  1. Run testtrim’s own test automation suite, in it’s own CI system, under testtrim.
  2. Incorporate testtrim into an Open Source product’s CI system.

As these goals are tackled I expect to get some clarity on that core determination, and find out whether this concept will work, won’t work, or most likely, that it will work but has some limited applicability which I can begin to define.

New Features Since Last Update

syscall Tracing

On Linux, testtrim is capable of tracing a test with strace in order to identify all the system calls (syscalls) that are being made by a test. After analyzing the output, testtrim identifies two things that add onto the code-coverage logic and could cause additional reasons for a test to run again in the future:

  • Opening local files within the repo
    • If test_a reads a local file tests/fixture/data.txt when the test is run, then in the future if the git data shows that tests/fixture/data.txt is modified, testtrim will rerun test_a.
  • Network access
    • If test_a opens a network port to 10.1.1.1:5432 when the test is run, then in the future testtrim will always rerun test_a with the assumption that the network access might “be different now”.

However, the network access logic of “rerun every network test” is extremely conservative. So, testtrim now has the capability to have network configuration in order to customize the behavior for network access; it is now possible to:

  • Ignore test network access – good for something like an internal test server.
  • Rerun tests that access the network, only when other files change – good for something like “rerun all database tests whenever the database schema in these files change”.

External Dependency Tracing

When you upgrade an external dependency, testtrim will identify which tests used that dependency and rerun those tests. The modification to Cargo.lock or go.mod will be used to identify the changed module, and coverage-tracking being enabled across the dependencies will be used to find the relevant tests.

Go & .NET Platforms

testtrim is a project written in Rust, and so it targeted Rust as a first platform to operate on. I added some abstractions early in the development to support multiple platforms in the future, but I didn’t have any idea whether those abstractions would allow other platforms to actually be implemented.

Feature Rust Go .NET (C#, etc.)
File-based coverage tracking (ie. changes that will affect tests are tracked on a file-by-file basis; the least granular but simplest approach)
Function-based coverage tracking (Only theorized, not implemented at all yet)
External dependency change tracking
syscall tracking for file & network tracking
Embedded file tracking (ie. if a file embeds another file, changes to either will trigger related tests)
Performance 👍 OK Mega-👎
Test self-tracing (ie. modifications to tests are traced by coverage)

The Go platform is basically feature-for-feature comparable with Rust. There is one major exception: Go doesn’t perform coverage tracing on test files (*_tests.go). testtrim tries to cover this up with some file heuristics and regexes, but it does leave a gap where functions in *_tests.go that are reused in other tests won’t trigger their execution correctly.

The .NET platform ran into some very large limitations with external dependency tracking and has been frozen half-implemented for the time being. Further developing it doesn’t get me any closer to the main “determine whether this will work for large-scale projects” goal, but I think it will be possible to improve it in the future at the right time.

Remote API Server & Client

In order to use testtrim in a CI environment, it will be necessary to have a persistent database that is accessible between CI invocations. To support this, testtrim has an API server and client that exposes its coverage database.

The server is run with the testtrim run-server subcommand.

The client is used by setting the TESTTRIM_DATABASE_URL to access the http or https address of the server.

History Simulation

I’ve been exploring how effective testtrim is by using Open Source projects, and running testtrim against their recent commits. In other words, going back 100 commits, and running testtrim on each commit afterwards to see how many tests will be run.

The testtrim simulate-history subcommand automates this process and outputs a CSV file with statistical information about the simulation.

For example, as part of developing the Go platform I completed a simulation of an Open Source project called zap. The simulation results are inline with other Rust projects, suggesting about 90% of the tests are skipped on average.

# of Commits 100
# of Commits Successful Build & Test 100
# of Commits with Ancestor; testtrim Worked 99
Average Tests to Run per Commit: 10.1%
Median Tests to Run per Commit: 0.4%
P90 Tests to Run per Commit: 36.6%
   
Average # Files Changed per Commit in Range: 3.97
Average Ext. Dep. Changed per Commit in Range: 1.25

Minor Features

  • Embedded files:
    • When files are embedded in the source code (Rust: include!(), include_str!(), include_bytes!(), Go: //go:embed), any modifications to those files will be treated as a modification to the embedding file. This means that related tests should be rerun automatically.
  • Test coverage tags:
    • When running tests, the --tags command-line option can be used to differentiate coverage, allowing the same project to be tested in different configurations. For example, you could run tests with a tag database=postgresql, and later tests with a tag database=mysql, and the two tags would have coverage maps tracked separately. This would allow a change that only affects one codepath to trigger only tests that are related to that codepath.
    • An automatic platform tag is enabled by default, eg. x86_64-unknown-linux-gnu. If a project is multiplatform but uses conditional compilation, this will prevent corrupted coverage data from leaking between the two platforms.
  • Transparency into test targeting:
    • When get-test-identifiers is run, testtrim outputs each test that it plans to run for the current repository, as well as why it will run. It might run because: there’s no coverage map to calculate results on, the test is new, a file has changed, a network dependency is present, an external dependency of the project has changed, etc.
  • Parallel test execution:
    • For the Go and Rust projects, test execution is parallelized to improve performance.
  • Release tooling:
    • As part of running testrim in a CI environment, and hosting a testtrim API server, I need to have a released version of testtrim. testrim now has a fully automated release process with automatic changelogs, and publishes an OCI container for running the API server.

testtrim on testtrim (“Self Hosting”)

A lot of issues have been resolved on the pathway to getting testtrim to run on itself, but a couple complex issues remain:

Real-world Network Tracing

testtrim runs its tests under strace and captures the network access that the tests performs. It’s very common for an automated test to reach out to a database, or a locally hosted version of the application, for example.

flowchart LR
    localhost:5432[(PostgreSQL on localhost:5432)]
    localhost:8443[(API on localhost:8443)]
    testtrim-->test_coverage_db
    test_coverage_db-->localhost:8443
    test_coverage_db-->localhost:5432

testtrim has new capabilities to match against these network patterns and apply rules to them. FIXME: add a link here to the network policy

When running testtrim-on-testtrim, most of the tests had network access patterns that were easy to match with network patterns and apply the right logic to the tests. But one end-to-end integration test started to expose some complications. This is a trimmed output from the get-test-identifiers command, which in testtrim shows you which test will be run and why; all the easy to deal with problems have been removed:

RustTestIdentifier { test_src_path: "tests/linearcommits_filecoverage_tests.rs", test_name: "linearcommits_filecoverage::dotnet_test::dotnet_linearcommits_filecoverage" }
  ...
	NetworkDependency(Inet([::ffff:152.199.4.184]:443))
	NetworkDependency(Inet([::ffff:192.229.211.108]:80))
	NetworkDependency(Inet([::ffff:23.217.131.226]:80))
	NetworkDependency(Inet([2001:67c:1401:20f0::1]:443))
	NetworkDependency(Inet(10.89.0.1:53))
	NetworkDependency(Inet(217.197.91.145:443))
  ...

The remaining network dependencies fall into three categories:

  • Downloading from nuget.org during this .NET build process. This specific test is running a .NET build with an external dependency that needs to be downloaded – but even though there’s network access on every test, it’s downloading a file that doesn’t change between test runs.
  • Downloading from codeberg.org a test repository (dotnet-test-specimen) to run the .NET project. This is somewhat the “subject under test” and might be kinda fair – but again, I don’t want this test to rerun just because it does this behavior.
  • Accessing a CRL/CA check to inspect whether any https access is subject to a certificate revocation list. There is extremely little reason to rerun tests because of this dependency – but need to identify how to ignore it accurately.

Boiling down these examples a little gives two problems:

  • Sometimes a test might download “static content”. You don’t want the test to rerun just because it does this download; you want it to rerun if the content changes.
    • nixpkgs has an approach to external dependencies that might be inspiration for a solution here: download the dependencies outside of the test, hash them, use the hashes as an input to the test that is traced by testtrim, and then update the tests to use the downloaded local copies. It’s a great outcome for test quality – they get the right external content, they rerun when it changes, the test accomplishes the same test goal – but has the downside of drastically changing how a test is run. Maybe with tooling that does the heavy lifting it would be feasible?
  • When a test accesses a network resource, the syscall tracing doesn’t capture the DNS resolution involved, just the raw socket connect syscalls with their IP addresses. As a result it’s hard to create useful long-term network policies.
    • I’d love to be able to intercept the DNS requests a process makes and match the network policies on DNS names. With modern Linux systems that use caching daemons (nscd) I’m not sure if I can capture these requests through an strace

Recursive strace Failure

When testtrim runs on a project, it runs each test independently to collect its coverage data. Each of those tests is also run under strace in order to capture the network and filesystem access that the test performs.

Imagine that you’re running testtrim on a project called “Test Project”, which has a test case “ABC”. After discovering all the tests in the project and evaluating what needs to be run, testtrim will eventually run the test “ABC”, wrapped in strace:

sequenceDiagram
    testtrim->>strace: Run Test ABC
    strace->>Test Project: Run Test ABC
    Test Project->>strace: Test Results w/ Coverage
    strace->>testtrim: Test Results w/ Coverage & syscalls

However, when I’m attempting to run testtrim on testtrim, it has a small number of test cases which validate the strace functionality. That means that after discovering all the tests in the target project (testtrim), testtrim will try to run a subprocess to execute that test under strace. (testtrim(cli) here indicates the testtrim command-line that is being executed, and testtrim(sut) indicates testtrim in the context of being a subject-under-test):

sequenceDiagram
    testtrim(cli)->>strace(1): Run Test 'e2e-test'
    strace(1)->>testtrim(sut): Run Test 'e2e-test'
    testtrim(sut)->>strace(2): Test uses `strace`...
    strace(2)--xtesttrim(sut): PTRACE_TRACEME:
Operation not permitted

This failure is because strace(1) is running the test process with the --follow-forks option to capture all subprocesses’s information. The subprocess is already being traced, and another invocation of strace will fail.

This is an outstanding issue that prevents testtrim-on-testtrim, but I think it’s a pretty narrow problem that doesn’t generalize very well to other projects. That said, I still want to get it resolved. There are a few approaches that I have in mind:

  • It would be possible to mark some tests as exempt from syscall tracing. The affected tests won’t be subject to reexecution when they should be, but, it’s an option.
  • It might be possible for the subprocess to recognize that it is being traced already by testtrim, and then work with the tracing data generated by it. This would look something like testtrim(cli) setting an environment variable TESTTRIM_STRACE_OUTPUT, and then when testtrim(sut) starts its test it won’t start a new strace, but instead will read the data file at TESTTRIM_STRACE_OUTPUT for its syscalls. It’s a pretty ugly solution.
  • testtrim could have a new capability added to skip tests. Although this doesn’t truly solve this problem, it might be a reasonable workaround – I want testtrim to run on it’s own CI, but I also think I can’t trust testtrim to test testtrim, and it might make sense for testtrim to rerun its test suite unconditionally.
  • testtrim could use some compile-time flags to remove the tests, rather than skip tests, during the CI.

An interesting problem that I’m a little stumped on, other than “don’t run these tests”.

What’s the future?

Short term:

  • Finish having testtrim integrated into it’s CI.
    • Discover new problems from this. Fix them.
  • Choose a target Open Source project to integrate testtrim into.
    • Discover new problems from this. Fix them.

Medium-term:

  • testtrim should be usable on developer workstations to reduce test cycle time. But there are some obvious blockers:
    • syscall tracing only supported on Linux.
    • Other capabilities that might be platform-dependent (eg. file pathing related work) haven’t been tested extensively outside of Linux.
  • Evaluate alternatives to having testtrim be a test executor.
    • Every platform has basic or advanced tools for test execution; for example, Rust has a great cargo test capability, but also an advanced alternative cargo-nextest with great value-add. I want testtrim to be able to be focused on the challenging problem of fast, accurate, test impact analysis and test selection. I don’t want to have to rebuild all the other capabilities of a test executor. Reconciling these two paths will be important to making a tool that can be value-add to an ecosystem.

Longer-term:

  • Distributed testing – eg. web application testing – I have a theory on how to do this with Open Telemetry tracing, but no design, no implementation today.

Follow along, share your thoughts!

testtrim is under heavy development, and work will continue in 2025… until we see whether it will really be practical. It’s an Open Source project, and I’d love to get feedback from people on the concept, and share in the development as it goes.