-
Notifications
You must be signed in to change notification settings - Fork 21
Reliability Strategy
David Mohs
April 9, 2018
Updated May 7, 2018
The goal of an application's reliability strategy is to ensure that the application's users have the best possible experience at all times. Therefore:
- The system should be operational at all times that users desire to use the system.
- Only the production environment matters.
- When services our application depends on are down, the best experience our application can provide is very limited.
It is best if a bug is caught before reaching users. It is nearly as good if a bug that has reached users is caught and fixed quickly. If a particular feature breaks more than twice, it will likely cause users to lose confidence in the system.
Reliability must be considered within the context of the project's other priorities. A solution that hinders developer productivity slows down both feature development and bug fixes. This directly affects reliability and negatively affects user experience by providing a less-capable product.
We can consider two classes of tools to assist in creating a reliable system: tests and analytics. Tests help to ensure correctness proactively, allowing us to catch bugs before users see them. Analytics allow us to monitor the system for problems experienced by our users. While we prefer to catch bugs with tests, perfect testing is infeasible. Users routinely exercise a system in creative ways that developers cannot predict, so a strategy involving both tests and analytics is usually the best choice.
Testing is usually divided into unit tests and integration tests. Unit tests ensure the continuing correctness of a particular feature. Integration tests ensure the correctness of the system within the context of its operating environment. Because of the complexity of executing tests within the system's operational environment, unit tests are generally easier to write and faster to execute.
Any testing strategy must solve each of the following problems:
- creation of test data
- creation of the component (i.e., unit) within the test environment
- assertions about the result of execution
The following section discusses the Saturn team's strategy for each case.
Saturn unit tests React components using a JavaScript testing framework called Jest. The developer writing the test will usually create test data by interacting with the API, capturing the response, and manually porting that response into data structures used to recreate it in the test environment. This is quite high-touch. A preferable strategy might be to auto-generate this data from automated API interactions that run on a regular basis.
Component creation is handled by tools that allow us to render React components outside of a normal browser environment. We assert only that the rendered output matches previous renders, known as snapshot testing. This allows us to ensure that a component doesn't change unexpectedly (i.e., we catch regressions) without requiring the developer to manually assert the rendered output. Since the snapshots are committed to our Git repository, changes to a component's rendered output can be tracked over time.
Besides the rendered output, it would also be desirable to test the logic that causes changes to state within a component (and, thus, the rendered output). We intend to test these as-needed using traditional unit testing methods within the same framework.
The APIs we access currently are relatively stable. We do not test these interactions on a regular basis. Instead, they are tested in production manually by our own team without a defined frequency. We will revisit this strategy as our needs change.
We use New Relic to gather data on the health of the platform. New Relic allows custom events, which allows us to track specific user activity, create funnels, etc. A member of the team reviews these reports daily.
This strategy gives the Saturn development team a reasonably high-level of confidence that changes to the code will not cause regressions while imposing a minimal burden on each team member's productivity. Test data creation remains ad-hoc and the team will periodically evaluate the need to formalize or automate this process. Similarly, the team will use analytics to catch changes to dependent services that break the application in production to determine whether integration testing is warranted.
Drawing from past experience, I have worked on two projects that had excellent reliability. The first was On the Fly, an iOS app for finding the lowest fare on a flight. On the Fly was manually tested by the development team prior to each release by running through some typical scenarios. In between releases, the code was engineered so that most bugs could be caught during code review by inspection.
The other project was Lollipuff, an e-commerce site. Features were tested manually before release. Regressions were caught by the team using analytics—if the historic usage of any feature of the site appeared to have dropped, the team investigated to see if the feature had been broken or degraded. Our analytics were good-enough to detect bugs that only affected, for example, Internet Explorer users.
"Test data" is useful outside the context of tests. When developing any feature or fixing a bug, a developer must put the software in various states. This usually requires generating data in the system, which can be done by driving the real system in a particular way or by creating fake data and mocking the necessary interactions.
Mocking data and interactions can be hazardous. It is quite easy to create mock data and interactions in such a way as to nearly duplicate the functionality and complexity of the external system, thus substantially decreasing productivity. Additionally, mock data hides changes to the unmocked data or service which can cause surprises during integration. Any mocking strategy would benefit from taking this into account by, for example, validating some portion of the mocked data with periodic integration tests.
Questions To Answer:
- How do we ensure that code pushed to production is unlikely to be broken in a significant way?
- How does a developer create the data they need to start work on a bug or feature?
- How do we learn that a user is having difficulty with the application?
- How do we learn that the production application is broken?
Proposal for Analytics:
- For all users who visit the site, some percentage should sign-in.
- For all users who sign-in, some percentage should create a workspace.
- No JavaScript errors should ever be triggered.
- etc.
Proposal for Developer Data (a.k.a. Test Data):
- A developer must become familiar with the API calls/external dependencies for their feature.
- The development environment will allow record and replay for all external dependencies.
- Exceptionally difficult interactions may be saved for later.
React Component Unit Testing:
I thought this article was very good at describing how to write tests for React components. My judgement after reading it was that it may not be worth writing these kinds of tests.
Alternatively, it might be exactly the right approach to solve the Developer Data problem by way of test-driven development. I could envision:
- Mock ajax calls to test all state transitions.
- Regression test the component by capturing its rendered output for each relevant state.
My main concern when asserting against rendered output is, as mentioned in the article, one ends up writing code that looks a lot like the render method. Instead, I'd prefer to see the output captured and diffed against a known-good version. The diff can be inspected to ensure it looks like what the developer expected given the changes they made.
Terra UI Wiki.
- Getting Started
- Contributor Guide
- Intro to UI Development
- Troubleshooting Build Failures
- Editor Configuration
- BEEs
- Pull Requests
- How to Find a PR Site
- Feature Flags
- Mixpanel
- Cobranding and White-Label Sites
- Using Terra UI packages in other projects