Split Testing

Introduction

Whenever we want to add a new feature to the LMS or otherwise change the website, we should split test it before rolling out a full implementation to all of our users. Also called A/B testing, split testing can tell us what effect a new feature will have on our core metrics before we deploy it fully. In this way, we can avoid rolling out a new feature if it does not provide a significant boost to, or even worse, hurts, our metrics.

For example, split testing would have been useful before we deployed inline score displays for each problem. Do users who are shown their problem scores within the courseware leave the courseware as often as those users who are not shown their progress within the courseware? Is the progress page viewed less frequently by those students shown their scores within the courseware? We asked these questions before implementing the inline score display feature, but the only way we could answer them was to implement the feature and observe its impact. Split testing would have allowed us to answer these questions before rolling out the feature to our entire user base.

In short, having the ability to perform split testing of new features before their rollout allows us to make smarter decisions.

Requirements

At the most basic level, we need a way to do the following:

Divide our users into groups
Control which Django views or templates are exposed to these groups
Monitor and compare the behavior of these groups of users

We also want our method of split testing to be flexible. That is, we want to be able to apply it across the platform to changes ranging from cosmetic alterations to entirely new features. In addition, we would prefer an approach which does not require a large volume of new code. Even better, we would like an approach which does not require pushing code to turn features on or off.

Implementation Details

There are several open-source A/B testing frameworks for Django. These include Django-lean, Django-experiments, Django-AB, Splango, Django-mini-lean, and Leaner. However, after doing some research, we decided to stay away from these. The majority are not compatible with Django 1.4, are no longer maintained, and are either poorly documented or entirely undocumented. What is more, in an attempt to abstract the process of split testing, these frameworks are often too rigid for our broad variety of potential use cases. Most are designed with the express purpose of tracking user signups or purchases. A more flexible and lightweight approach is to use a Django feature flipper in conjunction with Mixpanel.

Overview of Waffle

Waffle is a clean and simple feature flipper for Django. See the slides from this talk for more information. Using Waffle requires the creation of three new database tables, necessary for keeping track of Waffle Flags, Switches, and Samples. Note that tables of users belonging to different testing groups are not kept. This is because Waffle's flags are session based, switches are named booleans stored in the database, and samples are named percentages stored in the database.

We will use Waffle's flags in Django views and Mako templates to toggle features, cosmetic and otherwise. Flags are maintained by using cookies. This means more savvy users could bypass them by deleting or editing their cookies. As such, we don't want to use flags for features we want to force users into, but they work well for the purposes of split testing: they can be assigned to all users, a group of users, or an arbitrary percentage of users. Given appropriate permissions to the Flags model, flags can be created, toggled, and deleted from the Django admin site without involving DevOps. Changes made to flags go into effect immediately without the need to push any code.

We will keep track of which users are associated with which Waffle flags by adding a property to Segment.io tracking calls containing a string indicating which flags are active for the user. Segment.io will route this data to Mixpanel where we can segment the data, allowing us to compare the behavior of users in different flag groups.

Waffle's Cookie Use

As mentioned above, Waffle's flags make use of cookies to determine what a particular user should be shown. Cookies generated by Waffle have a default lifetime (WAFFLE_MAX_AGE) of one month. If Waffle is instructed to fall back to a percentage of users and sees that one of its cookies is set, that cookie's value is used and is then re-set to the same value. Since a cookie's lifetime is also re-set on every request that uses the cookie, WAFFLE_MAX_AGE does not need to be set too high. We want it to be just high enough so that a typical returning user won’t flip back and forth between different groups. In light of this, I've changed the lifetime of Waffle's cookies to two weeks. This means that if a user does not return for a period of two weeks, their cookie will expire and they'll be forced to check with the database again for the value of the flag. This means users will have fewer cookies from us sitting in their browser for months. If two weeks is not long enough, we can easily extend the cookie lifetime.

Cookies corresponding to Waffle flags share the same prefix as part of their name. By default, this is dwf_[flag name]. For clarity, I've changed this prefix to waffle_flag_[flag name]. To temporarily deactivate experiments, set cookies marked with the prefix "waffle_flag_" to False. The easiest way to do this is to download an extension such as Edit this Cookie for Chrome, or Cookies Manager+ for Firefox.

How To Perform Split Testing on the edX Platform

The following is a step-by-step guide to performing split testing on the LMS. Note that the following method requires a Mixpanel account. For help using Waffle in particular, refer to its documentation.

Design

Decide on a particular facet of the platform to test. This could be a UI change or new feature. Smaller is better. For example, I chose to experiment with the high-level LMS course tabs. Find a goal or metric which will help to show that one version of your change is better than another version. For my example, this metric was the number of students who visit the courseware. Once you've done this, plan the different experiments you want to run. Try to limit each experiment to one or two changes. Simpler is better. We want to isolate variables so that we know which changes are most likely responsible for the data we collect. You can run as many experiments as you want at the same time. In my example, I tried applying a visual treatment to the "Courseware" tab, and as a separate test, merging the "Courseware" and "Course Info" tabs into a new "Course Content" tab. Consider using a split testing significance calculator such as KISSmetrics' to figure out what you need to see in order to know if your split test is statistically significant.

Execution

Implement your experiments. Hide the code required to run each experiment behind a conditional dependent on an appropriately named flag being active. Name your flags well; try to use underscores or hyphens instead of spaces. Instructions for interacting with Waffle's flags from within your code can be found here.

Once you've done this, verify your experiments locally and then on stage by using the Django admin site to create your flags and then clicking through your experiments. Access to the Django admin site on stage and production will require that you be added to the "Split Testers" permissions group. This will give you access to the model governing Waffle flags in the database, which you can grant to yourself if running locally. Access to the flags model will give you the ability to manage them from the admin site.

When creating a flag, you can turn it on for everyone or a certain percentage of users, or turn it off for everyone. Note that flags which haven't been created yet evaluate as being turned off. Once your code is running on production, you need to again create Waffle flags for the experiments you want to run, since your local environment, stage, and production all use different databases.

Analysis

Now comes the fun part. Login to Mixpanel and select the appropriate project. From here, you can segment your data to compare the behavior of users in the different groups. You could build out a funnel relevant to the behavior you're testing. This might be something like navigating from the student Dashboard to any course content. Scrolling down to the "Overview" section of your funnel will allow you to select the "Active Flags" property from the "Properties" drop-down. The displayed rows will contain flag names and show you how users behaved along this funnel with different flags active. Make sure you select an appropriate date range to look at when creating a funnel. Mixpanel has good support and instructional videos if you get lost when working with your data. Use a split testing significance calculator such as KISSmetrics' to determine if the results of your split test are statistically significant.

Clean Up

Once you've run your experiment to completion, use your data to determine which version of your change is best. Now you can touch up this "best" version, test it thoroughly, and then roll it out to all users by adding it to the codebase. Remember to remove the code corresponding to your other tests from the codebase - we don't want unused conditionals sitting around. Don't forget to use the Django admin page to delete your now unused flags!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly