Anti-Pattern: Blobs of test data

April 29, 2015 by nicolas, tagged testing and programming, filed under projects

During the development of automated tests, test data is sometimes represented in blobs, stored in central repositories. They are often shared across automated tests and help setting them up. The repositories can take the form of code (constructing a complete tree of objects), files or even relational databases.

The creation of a shared repository of test data is often introduced because creating and setting up test data is difficult or costly, both at development and execution time. Some reasons:

the domain objects and their collaborators are hard to construct, fake or wire in a test,
the domain is itself very complex and test writers have to master many aspects of the domain to create the correct test data at runtime,
the creation of these objects takes time to execute.

Check if those reasons really apply to your software project. Is 2) inherent to your domain? Can 1) and 3) be remedied? Are they even maybe a result of the application of this anti-pattern?

Personal experience

I have worked on a project that had shared data in the form of a centralized database against which every unit + acceptance test suite would be run.

The database had been created at a certain date, then updated sometimes by hand (handwritten SQL) or code as well as standard database schema migrations.

When a test failed it would be because the behaviors of the code changed (intentionally or not) or because the test data had not been migrated. Finding out what nature the test data had was also difficult. Did that person write this test against object O because it was an object with a precise, intended set up or because somehow it had some property that the writer of the test liked? Those aspects were almost never documented. In effect it meant that the test would often not document what it was constructed against.

Also the test data always grew because modifying items would mean taking the risk of breaking tests that you had no idea how to fix.

Why is a blob of test-data an unit-test anti-pattern?

A good unit-test is fast, precise, readable and isolated. It brings confidence into the working state of the system under test.

Tests become hard to read, imprecise and poorly isolated

Unit tests written against a blob of test data tend to be hard to read, poorly isolated and imprecise.

When a unit-test refer to the entire blob or even part of it, they are potentially depending on the entire tree rather than isolating only a part of the system.

When the test cherry-picks one particular item of the test data blob, the precise setup that the test is using is barely described. One must read the data to find out what the test is actually doing.

When creating a new test, it is very tempting to just look around and inherit one piece of data that someone else’s has written. This becomes a liability if this item is touched further, and couples the two tests implicitely. (I.e. the test failures are correlated)

It also means the test in question never really can state what its starting state is.

And if one cherry-picks the correct data within the blob in practice each tests get its own test data within the entire blob, which means that the blob is growing with the number of tests and never shrinking.

Tests become hard to trust

Unit tests written against a blob of test data also tend to be harder to trust.

In the long run as the application changes so must the test data. When the test data is not correctly versioned or updated then it becomes difficult to trust it. Although code-generated data is superior in this way because at least it can be made to use the basic operations of the data model, leading to well-formed test data in practice it’s always a bit of a mixture of static and generated data.

Tests are still slow

Finally performance wise, although these blobs are often brought in to solve performance issues with setting up the tests, if the test data is mutable, all modifications made to the blobs must be rolled-back so as to keep each test isolated. This may undermine the expected performance benefits of the shared data.

It goes further: when the test data repository is actually a shared resource such as a database, then it is inefficient under heavy parallel testing, making the unit test suite run slowly.

Why is a blob of test-data an acceptance-test anti-pattern as well?

While a unit test tests a system, an acceptance test tests a product.

A good acceptance test embodies the specification of the product in user terms.

When written against a blob of test data, an acceptance test becomes poorly specified. It starts depending on implicit properties of the test data.

Suggestions & Example

Write tests which directly construct their own starting state.

Unit-Test Example: specifications-based setup

A concrete alternative is to write your unit-test in this way:

a setup phase that constructs the objects out of a concise specification (a compressed version of your test data)
a test phase which operates on the resulting domain objects and verifies its expectations.
an unwind phase where the domain objects are destructed

An example in javascript:

function test_thatNotesCanBeDeletedWithADoubleClick() {
    withMidiEditorOnNotes(
        // specification for this test's data:
        [
            { midiPitch: 64, startTime: 7.0 },
        ],
        function (midiEditor, midiNotes) {
            doubleClick(midiEditor, timeToX(7.0), midiToY(64));
            verify(midiNotes.isEmpty());
        }
    );
}

Commentary on suggestion

For unit-tests this means constructing the smallest amount of domain objects necessary for the system under test.

For acceptance tests this means dedicated setup code to move the product into a desired state via domain object manipulation. It is acceptable here to use dedicated shortcuts (using model operations) to bring the product efficiently into this state.

All in all, creating well formed domain objects should anyway not be an after thought. Types with good specification and defaults that create well-formed values allow the creation of domain object values which can be directly used by tests.

It translates into domain objects that can be created anywhere (In C++: on the stack/on the heap), objects that can live standalone without being part of a complex network of other objects. I.e. properties of a modular code base.