Presenters: David Cramer from DISQUS
PyCon 2012 presentation page: https://us.pycon.org/2012/schedule/presentation/12/
Slides:
Video: http://www.youtube.com/watch?v=QGfxLXoMpPk
Video running time: 41:20
Shipping new code as soon as it's ready
- Reviewed by peers
- Passes automated tests -- continuous integration is essential
- Some level of QA
┌────────────────┐ ┌────────────────┐
| Review | ← | Commit |
└────────────────┘ └────────────────┘
↓ ↑
┌────────────────┐ ┌────────────────┐
| Integration | → | Failed build |
└────────────────┘ └────────────────┘
↓
┌────────────────┐ ┌────────────────┐
| Deploy | → | Reporting |
└────────────────┘ └────────────────┘
↓
┌────────────────┐
| Rollback |
└────────────────┘
Continuous deployment does not necessarily mean that you deploy all the time or every 5 minutes.
You can deploy as often as you want. The important thing is that you can deploy whenever you want.
- Develop features incrementally
- Release frequently
- Smaller doses of QA
- Culture shock
- Stability depends on test coverage
- Initial time investment
- DISQUS is a company of about 30 people.
- Spent the last 2 years working on infrastructure - automated testing tools, reporting, etc.
- We have a guy dedicated to tests and a guy dedicated to releasing.
- You really have to instill this in your culture.
- Automate testing of complicated processes and architecture
- Simple can be better than complete
- Especially for local development
python setup.py {develop,test}
- Puppet, Chef, Buildout, Fabric, etc.
Automated testing is a requirement. Continuous integration is the basis for all of this.
David feels that packaging your app is essential, because you need things to be repeatable.
- Simplify local setup
git clone dcramer@disqus:disqus.git
make
python manage.py runserver
- Need to test dependencies?
- virtualbox +
vagrant up
(Link to Vagrant)
- virtualbox +
We actively use early versions of features before public release.
At DISQUS, we do about 12,000 to 15,000 requests/second and we peak much higher than that.
It's important that a feature doesn't take the site down. We want to slowly release features.
(9:32) Feature flippers or switches
Deploy features to portions of a user base at a time to ensure smooth, measurable releases
They use a platform called Gargoyle -- currently very Django-specific, but trying to generalize it to be Django-agnostic and maybe even language-agnostic.
Example:
- Only enable this new feature for internal users.
- OK, now turn it on for 1% of our base.
- Keep bumping up until we know it's scalable.
Early adopters are free QA
from gargoyle import gargoyle
def my_view(request):
if gargoyle.is_active('awesome', request):
return 'new happy version :D'
else:
return 'old sad version :('
New users can check a box to volunteer to test bleeding edge features.
Phabricator - a code review tool open-sourced by Facebook.
(12:40) When you do a code review, it's done through a commit - friendly for developers. Don't have to use the Web UI.
arc diff
runs a set of lints and runs your tests for you.
They've released a plugin for nose called quickunit.
- Developers must know when they've broken something
- IRC, Email, IM
- Support proper reporting
- XUnit, Pylint, Coverage.py
- Painless setup
apt-get install jenkins
It's important for developers to know right away when stuff is broken so they can ideally fix it before they've context switched to something else.
False positives
- Reporting isn't accurate
- Services fail (even a third party service)
- Bad tests
Test coverage
- Regressions on untested code
Feedback delay
- Integration tests vs. unit tests
- Rerun tests several times on failure
- Report continually failing tests
- Replace external service tests with a functional test suite
- Raise awareness with reporting
- Fail/alert when coverage drops on a build
- Commit tests with code
- Coverage against commit diff for untested regressions
- Utilize code review
This where almost all of our time has gone.
At one point our test suite took 40 minutes to an hour.
- Write unit tests
- vs. slower integration tests
- Mock external services
- Distributed and parallel testing
- Matrix builds
<You> Why is mongodb-1 down?
<Ops> It's down? Must have crashed again.
- Rate of traffic (not just hits!)
- Business vs. system
- Response time (database, web)
- Exceptions
- Social media
Beyond Nagios and PagerDuty.
Tracks and graphs metrics
We send it response times, counters, disk space usage
You can now use it even if you're not using Django.
It's designed to receive exceptions and track them.
Deployment - the least important part of continuous deployment. Everyone solves it differently.
What DISQUS does. Ship a relocatable virtualenv as a tarball.
- Package your app
- Value code review
- Ease deployment, fast rollbacks
- Setup automated tests
- Gather some easy metrics
- Build an immune system -- automatically rolls back if some metric goes down -- very interesting, but very risky
- Automate deploys, rollbacks (maybe)
- Adjust to your culture
- There is no "right way"
- SOA == great success
Code reviews: Before Phabricator, DISQUS used GitHub pull requests but they found it to not be scalable.
(31:49) Selenium tests -- we deleted all our Selenium tests. We're reimplementing some of them, but simpler.
(32:50) How many times a day do you deploy? At minimum, once a day. Lately, it's been no more than half a dozen times per day.
(33:55) Why do you roll back? Why not fix it and move forward? Sometimes it might take a while to fix it.
(34:30) What do you do about database changes? Especially for rollbacks. Google DISQUS schema changes or David Cramer schema changes
(35:52) Any code review policies? Maximum # of lines or maximum amount of time until review. Current standard is at the start of the day and the end of the day, you must clean your slate. Even this kind of sucks, because you may have to wait a day to get your change reviewed. What we really want is to give a max of 20 minutes and if it isn't reviewed, then it automatically gets assigned to someone else.
(37:23) Numbers of production servers? 200ish. 4 billion pageviews.
(38:00) How long does it take to deploy? Ashamed to admit it. All of our servers in one location although we push a lot of stuff to Akamai.
(39:00) One monolithic deploy or many? We're moving towards SOA and away from monolithic.
(40:15) Can you tell us about your rollback process? At one point, it was just swap the symlink and restart the servers.
(40:33) Business metrics measurements - what tools? Graphite, statsd, porkchop
(41:10) Done.