I'm a big advocate of unit testing and typically aim for 100% code coverage on anything that's going to see real use.
Aside from improving stability, reliability, and maintainability, the test-writing process forces me to ask some hard-hitting questions.
Does this belong here? Am I doing the same thing elsewhere? Is this consistent with the rest of the code?
- Me, while testing
It's kind of like a code pre-review. It improves the codebase and my understanding of it.
Unit Testing Woes
You're probably thinking, "Great! Sounds like this software quality thing is just about licked." Unfortunately, such is not the case.
After writing many thousands of unit tests and reaching 100% code coverage across many codebases, the same problems keep cropping up.
The lines of code for unit tests typically outnumber those belonging to the module under test. Here is an example from the Jest documentation.
That's about as simple as it gets. The corresponding test is about the same size yet far from comprehensive.
Unit tests tend to get obliterated in refactoring. That makes sense because when you refactor, you're changing the units. In theory, many test cases should make it into the new tests. However in practice, it's often easier to rewrite them.
Another problem arises specifically when testing with code coverage.
You may also find yourself testing implementation details just so you can make sure you get that one line of code that's hard to reproduce in a test environment. You really want to avoid testing implementation details because it doesn't give you very much confidence that your application is working and it slows you down when refactoring.
- Kent C Dodds
I once read that software is either in one of two states: constant refactoring or dying. Neither bodes particularly well for unit tests.
Designing for Test
Writing code that is readily testable takes more time and effort than just getting something working. When you're prototyping, designing for test may not make sense and returning to make your code testable later can be expensive.
We tend to treat test code as second-class and don't hold it to the same quality standard as production code. For example, I normally aim to keep my functions short and files under 200 lines whenever possible yet flagrantly ignore these guidelines when testing. Why? Test code is still code.
I could live with all of the above if no bugs ever made it to production. Many are caught, some aren't, and some seem embarrassingly obvious.
These issues combined seem like a high price to pay for a partial solution. After accepting that no test suite could prove my program's correctness, I started thinking about a more pragmatic approach to testing and some of the qualities of an ideal test suite.
Optimist's Rendition of Ideal Test Suite:
Catches bugs and improves understanding initially when written and continuously and as changes are made.
Quick to run, the faster the better. If you're tempted to do something else while your tests run, you start to pay context switching costs.
Quick to write. We prefer to spend our time changing the world with production code after all.
Short. Software bugs are strongly correlated with lines of code and there's really no reason to expect test code to be different.
High quality. Tests should be held to the same quality standard as the rest of our code. We want to be able to tell at a glance what's being tested and how.
Decoupled. Tests not depending on code being written in a particular way are more resilient to changes and lets you focus on expressing intent.
Reliable. No false positives or negatives. It's hard to think of anything more frustrating for a coder than debugging an intermittent test failure that doesn't affect users.
Targeted. Introducing a bug causes a single test to fail, clearly indicating what the problem is.
What about E2E?
So the brute force unit testing approach isn't enough. What about end-to-end testing?
Anyone who's done any UI automation knows that special brand of frustration known as the flaky test. In 2014, I was consulting on a MeteorJS project that had hundreds of E2E tests using Selenium and PhantomJS. Here's the git log as I remember it.
d8eb580 fix flaky test f744d31 fix flaky test f88dd2d fix flaky test 0c27836 fix flaky test b6f8bfa fix flaky test 3e42121 fix flaky test 5e414d7 game changing feature
Reliability is a major concern with E2E testing; investigating false positives can get costly.
Difficulty isolating failures is another concern. When a test does fail, how easy is it to identify the reason?
A final concern is speed. Bringing up browsers and loading apps inevitably takes time.
Fortunately, native browser support for UI automation and modern tools like Playwright (Trace Viewer FTW 👏) have come a long way toward improving reliability and identifying failures.
So what can we do about speed?
Zero to One (hundred percent coverage)
A thing we can do is reduce the number of tests. All the way down. Like, one test.
Here's a test exercising all of the front-end code for VenueTube showing real-time code coverage.
Creating the test was relatively painless with Playwright, a bit of hacking, and the following process.
Run the test with a custom test runner (see below). Initially just navigate to the home page.
Start the Playwright Inspector (Code Generator) by appending
page.pause() to the end of the test. The test runner shown below does so automatically.
The hardest part of the process is viewing code coverage. The easiest
way is to write the coverage output to disk and use the
to generate HTML reports.
Look for uncovered code that can be easily exercised from the current state. Start recording with the inspector and perform the necessary actions manually.
Add to Test
Copy the recorded steps into the test. Give complex locators clear variable names and look for reuse opportunities.
Run again ensuring the new steps work and cover the targeted lines.
< 100% Covered?
Repeat the cycle until everything is covered.
Here's a simplified version of the custom test runner used in the above process.
The hacking part has to do with aggregating the test coverage and dealing with sourcemaps. It's omitted as it could easily fill another post.
These should be no surprise to experienced E2E testers but if you spend most of your time on unit tests, you'll have some new challenges to consider.
Because VenueTube uses passwordless authentication, signing in took a
bit of work. I used the
nodemailer package to create an
ethereal.email account that's API accessible. Then I used the
package to get the signin email and
mailparser to extract the
Firebase Authentication is one of the external APIs that the application sends requests to. Playwright provides a way to intercept requests and provide your own response. I didn't because:
- I don't intend to run the test in CI (yet)
- Hitting the real thing is low consequence
- It's more realistic this way
I did have to simulate a webhook to complete one workflow.
Exercising error handlers took some understanding of failure modes and a bit of tampering with reality. For example, to test the Sentry integration, I injected a script into the page to throw an exception (obviously none of my actual code would 😉).
Here's what the test looked like early on in the writing process.
So how does this solution stack up against our ideal test suite?
I caught a number of bugs, found dead code, and identified some really bad UX in the course of writing the test which I wouldn't have with unit testing. That said, there are almost certainly still bugs that unit tests would catch.
Whether the test catches regressions as features are added remains to be seen.
It takes about 90s which is close to the upper limit of what I'm likely to tolerate. We'll discuss opportunities to improve in a bit.
Correcting for one-time costs (Playwright learning curve and writing the test runner), it took about a day to actually write the test.
It's short. Around 250 lines.
Because the test was so short, I wasn't too concerned about quality. I did assign non-obvious element selectors to clearly named variables and group common sequences into reusable functions.
I didn't change any application code for the sake of the test although there were a few places where a different structure would be mutually beneficial. This indicates that the test should be resilient to application refactoring.
There is definitely still some flakiness but Playwright's Trace Viewer really simplifies finding the cause (typically poor selector choice).
If a Playwright command fails, it prints a debugging message and I explicitly log each step. However, if the failure is a real problem, it could take some digging through the app code to find.
The main motivator for this project was to get the greatest amount of test coverage in the shortest amount of time. While that was a success, there were also some unexpected benefits.
VenueTube started as a pipedream script and evolved from there based on user feedback. There was never a formal plan or even workflow documentation.
Finding and eliminating uncovered code was an exercise in understanding workflows and a good opportunity to retroactively create a requirements document. Even without a separate document, a well written test can function as a passable specification.
The same is true of unit tests but at a lower level. An E2E test provides value to a wider subset of stakeholders.
Writing the test also helped catch cumbersome UX. Some workflows that seemed intuitive when written were challenging to test. In several cases those turned out to be real usability issues.
For example, I noticed some vestigial code that could only be exercised by signing in and navigating to a specific URL. I needed to remove the (effectively dead) code or find a way to make it more accessible.
User and Developer Experience Alignment
It's possible to have blazing fast unit tests and horrible application performance. That's not the case with E2E testing.
Testing the entire app makes performance a pain point during development. Improvements benefit developers and end-users alike.
VenueTube is a simple monolithic app with only a few external API integrations. I'm able to get through all existing workflows in about 90 seconds. Your mileage may vary.
Test Suite Adequacy
Is code coverage even a good measure of test suite adequacy? After all, this test yields 100% coverage for our sum function.
The only information this test gives us is that
doesn't throw an exception. Similarly, I didn't
Instead, it relies on implicit assertions made by Playwright to proceed through
the test. There is nothing stopping me from adding assertions but it won't
improve code coverage.
Many forms of test adequacy criteria have been tried, and will be tried in this world of faulty software. No one pretends that code coverage is perfect or all-wise. Indeed, it has been said that code coverage is the worst form of test adequacy criteria except for all those other forms that have been tried from time to time...
- Winston Churchill, probably
Code coverage isn't perfect but I haven't found anything better. One non-obvious benefit is that there are two ways to increase it: add more tests or reduce the source lines that need testing. Targeting the latter can lead to a cleaner codebase.
The 100% mark is controversial and can be difficult to attain especially if you don't design for test from the start. I justify it with these two arguments:
- Any other number feels (more) arbitrary
- Ensuring that every line of code executes seems like a low bar
Since V8 actually returns coverage count per line, I'd even be inclined to go beyond 100% but need to give more thought to how that would work.
I'm certainly not advocating the use of a single test to provide all of your quality assurance. That would be crazy... right?
Here are some of the next steps I'll take to shore up my QA program.
Splitting the test into independent workflows could improve isolation and reduce runtime through parallelization.
Seeding the database could speed up certain workflows and reduce flakiness. In one case, my test started failing because randomly generated content contained the word "Holler" which is used in an element selector.
Complementing E2E tests with unit tests focused on especially complex or sensitive operations. Many low-level functions that are only lightly exercised here would make great unit testing candidates.
If you are early on in development of your app, I'd recommend trying out uni-testing™️. It's a low-barrier way to focus on UX and catch potentially embarrassing issues before your users do.
Once you get some traction, decompose the test suite into a proper test pyramid including integration and unit tests.
In the meantime, tell anyone who'll listen that your application has 100% code coverage.