During the development of automated tests, test data is sometimes
represented in blobs, stored in central repositories. They are often
shared across automated tests and help setting them up. The
repositories can take the form of code (constructing a complete tree
of objects), files or even relational databases.
The creation of a shared repository of test data is often introduced
because creating and setting up test data is difficult or costly, both
at development and execution time. Some reasons:
- the domain objects and their collaborators are hard to construct,
fake or wire in a test,
- the domain is itself very complex and test writers have to master
many aspects of the domain to create the correct test data at
runtime,
- the creation of these objects takes time to execute.
Check if those reasons really apply to your software project. Is 2)
inherent to your domain? Can 1) and 3) be remedied? Are they even
maybe a result of the application of this anti-pattern?
Personal experience
I have worked on a project that had shared data in the form of a
centralized database against which every unit + acceptance test suite
would be run.
The database had been created at a certain date, then updated
sometimes by hand (handwritten SQL
) or code as well as standard
database schema migrations.
When a test failed it would be because the behaviors of the code
changed (intentionally or not) or because the test data had not been
migrated. Finding out what nature the test data had was also
difficult. Did that person write this test against object O
because
it was an object with a precise, intended set up or because somehow it
had some property that the writer of the test liked? Those aspects
were almost never documented. In effect it meant that the test would
often not document what it was constructed against.
Also the test data always grew because modifying items would mean
taking the risk of breaking tests that you had no idea how to fix.
Why is a blob of test-data an unit-test anti-pattern?
A good unit-test is fast, precise, readable and isolated. It brings
confidence into the working state of the system under test.
Tests become hard to read, imprecise and poorly isolated
Unit tests written against a blob of test data tend to be hard to
read, poorly isolated and imprecise.
When a unit-test refer to the entire blob or even part of it, they are
potentially depending on the entire tree rather than isolating only a
part of the system.
When the test cherry-picks one particular item of the test data blob,
the precise setup that the test is using is barely described. One must
read the data to find out what the test is actually doing.
When creating a new test, it is very tempting to just look around and
inherit one piece of data that someone else’s has written. This
becomes a liability if this item is touched further, and couples the
two tests implicitely. (I.e. the test failures are correlated)
It also means the test in question never really can state what its
starting state is.
And if one cherry-picks the correct data within the blob in practice
each tests get its own test data within the entire blob, which means
that the blob is growing with the number of tests and never shrinking.
Tests become hard to trust
Unit tests written against a blob of test data also tend to be harder
to trust.
In the long run as the application changes so must the test data. When
the test data is not correctly versioned or updated then it becomes
difficult to trust it. Although code-generated data is superior in
this way because at least it can be made to use the basic operations
of the data model, leading to well-formed test data in practice it’s
always a bit of a mixture of static and generated data.
Tests are still slow
Finally performance wise, although these blobs are often brought in to
solve performance issues with setting up the tests, if the test data
is mutable, all modifications made to the blobs must be rolled-back so
as to keep each test isolated. This may undermine the expected
performance benefits of the shared data.
It goes further: when the test data repository is actually a shared
resource such as a database, then it is inefficient under heavy
parallel testing, making the unit test suite run slowly.
Why is a blob of test-data an acceptance-test anti-pattern as well?
While a unit test tests a system, an acceptance test tests a product.
A good acceptance test embodies the specification of the product in
user terms.
When written against a blob of test data, an acceptance test becomes
poorly specified. It starts depending on implicit properties of the
test data.
Suggestions & Example
Write tests which directly construct their own starting state.
Unit-Test Example: specifications-based setup
A concrete alternative is to write your unit-test in this way:
- a setup phase that constructs the objects out of a concise specification (a compressed version of your test data)
- a test phase which operates on the resulting domain objects and verifies its expectations.
- an unwind phase where the domain objects are destructed
An example in javascript:
function test_thatNotesCanBeDeletedWithADoubleClick() {
withMidiEditorOnNotes(
// specification for this test's data:
[
{ midiPitch: 64, startTime: 7.0 },
],
function (midiEditor, midiNotes) {
doubleClick(midiEditor, timeToX(7.0), midiToY(64));
verify(midiNotes.isEmpty());
}
);
}
Commentary on suggestion
For unit-tests this means constructing the smallest amount of domain
objects necessary for the system under test.
For acceptance tests this means dedicated setup code to move the
product into a desired state via domain object manipulation. It is
acceptable here to use dedicated shortcuts (using model operations) to
bring the product efficiently into this state.
All in all, creating well formed domain objects should anyway not be
an after thought. Types with good specification and defaults that
create well-formed values allow the creation of domain object values
which can be directly used by tests.
It translates into domain objects that can be created anywhere (In
C++
: on the stack/on the heap), objects that can live standalone
without being part of a complex network of other
objects. I.e. properties of a modular code base.