Testing software is one of the most complex tasks in software engineering. While in traditional software engineering there are principles that define in a non-ambiguous way how software should be tested, the same does not hold for machine learning, where testing strategies are not always defined. In this post, I elucidate a testing approach that is not only highly influenced by one of the most recognized testing strategies in software engineering - that is test-driven development. But also seems to be an approach that is agnostic from the family of machine learning models under testing, and adapts very well to the typical production environments that lead to the large scale AI/ML services of today.

After reading this post, you will learn how to set up a testing strategy that works for machine learning models with production in mind. Production in mind means that the team you are operating in is heterogeneous, the project under testing is developed together with other data scientists, data engineers, business customers, developers, and testers. The goals of a good testing strategy are to achieve production readiness and improve code maintainability.

An appropriate name of the approach is Test-First machine learning, in short TFML, because everything starts from writing tests, rather than models.

Steps of TFML

A characteristic of TFML is to start from writing tests, instead of machine learning models. The approach is based on mocking whatever is not yet available so that different actors involved in the project can proceed with their tasks anyway. It is known that data scientists and data engineers run at a different pace. Mocking a particular aspect of the world that is not yet available not only mitigates such difference but also reduces blockers within larger teams. This, in turn, increases efficiency. Below are the five essential steps of a TFML approach.

1. Write a test

As the name suggests, Test-First in TFML indicates that everything starts with writing a test. Even for a feature that does not yet exist. Such a test is usually very short and should stay so. Larger and more complex tests should be broken down to their essential and testable components. A test can be written after understanding the feature’s specs and requirements that are usually discussed earlier during requirement analysis (e.g. use cases and user stories).

2. Validate a test

A working test will fail or pass for the right reasons. This is the step in which such reasons are defined. Defining the happy path is essential to defining what should be observed and considered a success.

3. Write the code

In this step, the code that leads to the happy path is actually written. This code will cause the test to pass. No other code, beyond the test’s happy path, should be provided. For example, if a machine learning model is expected to return 42, one can just return 42 and force the test to succeed here. If time constraints are needed, adding sleep(milliseconds) is also acceptable. Such mocked values will provide engineers with visible constraints such that they can proceed with their tasks as if the model was complete and working.

4. Run tests

Adding new tests should never break the previous ones. Having tests that depend on each other is considered an anti-pattern in software engineering.

5. Add functionality (+ cleanup + refactor)

When values are mocked, success conditions are defined and tests are running, it’s time to show that the ML model under testing is training and performing predictions. Related to the example above, some questions that should find an answer in this step are:

  • Is the test breaking the constraints we set previously?
  • Is our ML model returning 84 rather than 42?
  • How about time constraints?

Traditionally, in this step developers perform code cleanup, deduplication, and refactoring (whenever it applies), to improve both readability and maintainability. This strategy should be applied to ML developers too.


Falling in the trap of alternative approaches is easier in machine learning due to its nature and the enthusiasm of data scientists who connect-train-analyze data in no time.

The most common approach in the data science community is probably the Test-Last approach a.k.a. code now, test later. This approach can be extremely risky in ML model development, since even for a trivial linear regression there might be just too many moving parts, compared with traditional software (e.g. UI, API calls, data streams, databases, preprocessing steps, etc.) As a matter of fact, the Test-First approach encourages and forces developers to put the minimum amount of code into modules depending on such moving parts (e.g. UIs and databases) and to implement the logic that should belong to the testable section of the codebase.

One important pitfall to avoid is developer bias. Tests created in a Test-First environment are usually created by the same developer who is writing the code being tested. This can be a problem e.g. if a developer does not consider certain input parameters to be checked. In that case, neither the test nor the code will verify such parameters. There is a reason why in traditional software development, testing engineers and developers are usually not the same individuals.

TFML anti-patterns

Below are some anti-patterns in TFML.

Test dependence

Tests should be standalone. Tests that depend on others can lead to cascading failures or success out of the developer’s control.

Test model precisely

As in traditional software engineering, testing precise execution behavior, timing or performance can lead to test failure. In machine learning, it is even more important to consider soft constraints because models can be probabilistic. Moreover, the ranges of output variables and input data can change. Such a dynamic and sometimes loosely defined behavior is the norm rather than the exception in ML.

Test model’s mathematical details

Testing model implementation details such as statistical and mathematical soundness are not part of the TFML strategy. Such details should be tested separately and are specific to the family of the model under consideration.

Large testing unit

The testing surface should always be minimal for the functionality under test. Keeping the testing unit small gives more control to the developer. Larger testing units should be broken down into smaller tests, specialized in one particular aspect of the models to be tested.

Conclusion

The TFML approach forces developers to spend initial time defining the testing strategy for their models. This in turn facilitates the integration of such models in the bigger picture of complex engineering systems where larger teams are involved. It has been observed that programmers who write more tests tend to be more productive. Testing code is as important as developing software core functionality. Testing code should be produced and maintained with the same rigor as production code. In ML all this becomes even more critical, due to the heterogeneity of the systems and the people involved in ML projects.