Merge pull request #116 from gravitational/evan/sre-challenge-update

SRE challenge refresh
gravitational · Nov 13, 2024 · 4a037ea · 4a037ea
2 parents 28ec1be + 03395e3
commit 4a037ea
Show file tree

Hide file tree

Showing 3 changed files with 68 additions and 184 deletions.
diff --git a/challenges/cloud/README.md b/challenges/cloud/README.md
diff --git a/challenges/cloud/sre.md b/challenges/cloud/sre.md
diff --git a/challenges/sre/challenge.md b/challenges/sre/challenge.md
@@ -1,105 +1,77 @@
-# Summary
+# SRE
 
-Build, deploy, and monitor a service that provides an API for scaling Kubernetes deployments.
+The interview process is divided into two main sections: a walkthrough and a challenge. In the walkthrough, led by the hiring manager, you’ll have an opportunity to share your work history, review our leveling matrix, and discuss the upcoming challenge. Following the walkthrough, the challenge will assess your skills and fit for this position.
 
-# Rationale
+We will use the challenge in order to evaluate your skill in the following areas:
+* Translating high-level requirements into a simple, functional design.
+* Writing production level code that does not have extensive or insecure dependencies.
+* Understanding of Go, Kubernetes, release engineering, and security.
+* Communicating with the team and handling feedback.
 
-This exercise has two goals:
+We believe this technique is not only better but also more fun compared to
+whiteboard/quiz interviews so common in the industry. It’s not without the
+downsides - it could take longer than traditional interviews. That said, it's
+our view that this type of challenge gives us a more accurate assessment of your
+ability to work well on the types of projects we’re working on day-to-day here
+at Teleport. [Some of the best teams use coding
+challenges](https://sockpuppet.org/blog/2015/03/06/the-hiring-post/). We
+appreciate your time and are looking forward to hacking on this project
+together.
 
-* It helps us to understand what to expect from you as a Site Reliability Engineer,
-  how you reason about a production service, how you reason about system design, your
-  opinions on automation & tooling, and how you communicate when trying to solve problems.
-* It helps you get a feel for what it would be like to work at Teleport, as this
-  exercise aims to simulate our day-as-usual and expose you to the type of work
-  we're doing here.
+Please come prepared to the walkthrough having reviewed:
+* [Site Reliability Engineering (SRE) Levels](../../levels/sre.pdf)
+* [Challenge documentation](../sre/challenge.md)
 
-We believe this technique is not only better, but also is more fun compared to
-whiteboard/quiz interviews so common in the industry. It's not without the
-downsides - it could take longer than traditional interviews.
+# Challenge
 
-[Some of the best teams use coding challenges.](https://sockpuppet.org/blog/2015/03/06/the-hiring-post/)
+In this challenge, you will create a Go server that interacts with a Kubernetes cluster, incorporating automated builds, containerization, deployment, and testing.
 
-We appreciate your time and are looking forward to hacking on this project together.
+## Getting Started
 
-# Requirements
+First, create a new private GitHub repository, invite the interview panel as contributors, and share the link with the team in Slack.
 
-In this challenge you will create a Go server that interacts
-with a Kubernetes cluster. You'll also add automation to build the server,
-create a container image, run the server, and execute tests.
+> [!IMPORTANT]
+> You’ll have up to 2 weeks from the agreed start date to complete the challenge. Please allow 24-48 hours for each PR submission to give the team ample time for review and feedback during business hours.
 
-The requirements vary depending on the level you are applying to. This
-challenge covers 5 engineering levels at Teleport. Level 6 is only for internal
-promotions. Check [Site Reliability Engineering (SRE) Levels](../../levels/sre.pdf) for more details.
+### PR Submissions
+The repository you created be used for each PR submission during the challenge.  For each submission, please ensure the following:
 
-Start by creating a new GitHub repository. Then let the interview panel know the
-repository's location by pasting a link in your interview Slack channel. Invite
-interview panel participants as contributors to the new repository if you prefer
-to keep your submission private.
-
-* Your submission will need to meet the requirements of the level you are applying for.
-* Split the submission into 2-3 pull requests for us to review. We will review
-  every pull request and provide feedback. Open a single pull request at a time
-  and wait for at least 2 approvals before merging.
-* Share a link with the interview panel on Slack each time you open a new pull request.
-* The interview panel will clone your repository and test it after the last pull
-  request is merged.
+* Your submissions meets the requirements of the level you are applying for.
+* Split the work into ~3 pull requests: one for the design document, one for the server code, and one for deployment & automation.
+* After submitting, the team will review and provide feedback. Open one PR at a time and wait for two approvals before proceeding with submitting another. You can continue working locally while waiting for feedback on the current PR.
 
 ## Design Doc
+The first pull request must be a brief design document that describes how you plan to implement the solution. At Teleport, we prefer Markdown for [our designs](https://github.com/gravitational/teleport/blob/master/rfd/0000-rfds.md).
 
-The first pull request must be a brief design document that describes how you plan
-to implement the solution. At Teleport, we prefer Markdown for [our designs](https://github.com/gravitational/teleport/blob/master/rfd/0000-rfds.md).
-
-Be sure to cover the following in your design:
+### Requirements
 
+Please be sure to cover the following design topics:
 * API structure
 * Developer workflow
   * Ease of contributing to the project from a fresh clone
   * Ease of building, running and testing the server
-* Level 3+: Build, Release
-* Level 4+: Caching, mTLS, and Delivery
+* Level 3+: Build and Release
+* Level 4+: Caching and mTLS
 * Level 5+: Reconciliation, Conflicts, and Automation
 
-A few notes about the design document:
-
-* We expect the design document to be completed roughly within the first week.
-  This is to ensure you have enough time to work on the implementation.
-* Avoid writing an overly detailed design document. 500-1500 words is
-  sufficient.
-* Avoid sending us draft design documents, because it is difficult to evaluate which
-  parts are draft and which parts are complete. Instead we encourage asking
-  questions in Slack.
-
-## Server & Automation
+### Suggestions
+* Complete (roughly) the design document within the first week.  This is to ensure you have enough time to work on implementation.
+* Avoid writing an overly detailed design document. 500-1500 words is a good target.
+* We encourage you to ask questions in Slack.
+* Avoid sending draft PRs for feedback as it is difficult to evaluate which parts are draft or complete.
 
-Once the design document is approved, begin working on adding features to the
-server implementation that includes automation. A key feature of this challenge
-is to produce a self-contained GitHub project that automates away as much as
-possible. This should include code quality, testing, build, and deployment.
+## Server, Automation, & Deployment
+Once the design document is approved, the next PR should include your server implementation, along with build automation to simplify testing. The interview panel will clone the repository and run Make targets to build and test your server code. Be sure to include a couple of high-quality tests that cover both happy and unhappy scenarios, and minimize external dependencies.
 
-Expect the interview panel to clone the repository and execute one or more
-Make targets that build and test a working solution. Add a couple of high quality
-tests that cover happy and unhappy scenarios. Keep the number of external
-dependencies low.
+> [!NOTE]
+> For level 4+, please research Kubernetes controllers and the recommended Go client libraries carefully. Understanding how controllers cache resources is key to implementing a straightforward solution.
 
-Do not try to achieve full test coverage. This will take too long. Write enough
-to exercise the different key components to show they are working as intended 
-while demonstrating your approach to automation.
+After the server code is reviewed, submit a final PR that includes the remaining automation, deployment features, or any remaining success criteria. A key aspect of this challenge is to produce a self-contained GitHub project that automates as much as possible, including build artifacts, code testing, and deployment.
 
-For level 4+, please research Kubernetes controllers and the recommended Go client
-libraries carefully. Understanding how controllers commonly cache resources
-is key to implementing a straightforward solution.
+### Tooling
+For evaluation purposes, your solution should be written in Go and is deployable to a local Kubernetes cluster. The choice of which local Kubernetes cluster is up to you, but please ensure compatibility with both macOS and Linux. We suggest [KIND](https://kind.sigs.k8s.io/).
 
-## Deployment
-
-For evaluation purposes, your solution should be deployable to a local
-Kubernetes cluster. The choice of which Kubernetes cluster is up to you, but
-please choose one that has the ability to run on macOS and Linux.
-We suggest [KIND](https://kind.sigs.k8s.io/).
-
-## Key Dependencies
-
-You are welcome to solve this challenge using tools you are most familiar with,
-but, at a minimum, please include the following:
+You may use additional external dependencies, but ensure that detecting or installing these are straightforward for the reviewer. At a minimum, your solution should include the following dependencies:
 
 * Go
 * make
@@ -188,8 +160,7 @@ but, at a minimum, please include the following:
   Read-only requests should not each trigger a request to the cluster.
   It is acceptable to use either client-go or controller-runtime to implement this.
 * Secure connections between the gRPC API and caller with mTLS
-* Server must store the desired state in a CRD (per-deployment) and reconcile the deployment to that state.
-  (gRPC API endpoints only need to read the real, current value.) 
+* Server must store the desired state in a CRD (per-deployment) and reconcile the deployment to that state. (gRPC API endpoints only need to read the real, current value.)
 * One or two tests that cover happy and unhappy scenarios
 
 ### Automation
@@ -205,78 +176,20 @@ but, at a minimum, please include the following:
 
 # Guidance
 
-## Interview process
-
-The interview team will join the Slack channel. The team consists of the
-engineers who will be working with you. Ask them about the engineering culture,
-work and life balance, or anything else that you would like to learn about
-Teleport.
-
-Before writing the actual code, create a brief design document in markdown in GitHub
-and share with the team.
-
-Split your code submission using pull requests and give the team an opportunity
-to review the PRs. A good "rule of thumb" to follow is that the final PR
-submission is a formality adding a small feature set - it means that the team
-had an opportunity to contribute the feedback during multiple well defined
-stages of your work.
-
-Our team will do their best to provide a high quality review of the submitted
-pull requests in a reasonable time frame. You are spending your time on this, we
-are going to contribute our time too.
-
-After the final submission, the interview team will assemble and vote using a
-"+1, -2" anonymous voting system: +1 is submitted whenever a team member accepts
-the submission, -2 otherwise.
-
-In case of a positive result, we will connect you to our HR and recruiting
-teams, who will work out the details and present an offer.
-
-In case of a negative score result, hiring manager will contact you and share a
-list of the key observations from the team that affected the result.
-
-## Code and project ownership
-
-This is a test challenge and we have no intent of using the code you've
-submitted in production. This is your work, and you are free to do whatever you
-feel  is reasonable with it. In the scenario where you don't pass, you can open
-source it with any license and use it as a portfolio project.
-
 ## Areas of focus
 
-Teleport focuses on simplicity, automation and robustness.
-
-These are the areas we will be evaluating in the submission:
-
-* Use consistent coding style. We follow
-  [Go Coding Style](https://github.com/golang/go/wiki/CodeReviewComments) for
-  the Go language.
-* At the minimum, create tests for reading and writing, and an unhappy/error scenario.
-* Make sure builds are reproducible. Pick any vendoring/packaging system that
-  will allow us to get consistent build results.
-* Ensure error handling and error reporting is consistent. The system should
-  report clear errors and not crash under non-critical conditions.
-* Production readiness. Once completed, the code itself, even if incomplete, should
-  be sufficiently solid and robust to make it to a real production cluster.
-* API design. Please include your proposed HTTP API or gRPC API in the design doc.
-  For the gRPC API, you should include a complete proto file in the design doc.
-* Security. Describe your mTLS setup in the design doc, including chosen cipher suites.
-  Ensure that your implementation is secure.
-
-The primary factor in the team's decision is overall code quality. We are looking for
-the highest possible quality with the smallest possible scope that meets the requirements
-of the challenge.
-
-## Trade-offs
-
-Write as little code as possible, otherwise this task will consume too much time
-and code quality will suffer.
+The primary factor in the team's decision is overall code quality. We are looking for the highest possible quality with the smallest possible scope that meets the requirements of the challenge.
 
-Please cut corners, for example configuration tends to take a lot of time, and
-is not important for this task.
+* Use consistent coding style. Internally we follow [Go Coding Style](https://github.com/golang/go/wiki/CodeReviewComments).
+* Make sure builds are reproducible and allow consistent build results.
+* Ensure error handling and error reporting is consistent. The system should report clear errors and not crash under non-critical conditions.
+* Production readiness. Once completed, the code itself, even if incomplete, should be sufficiently solid and robust to make it to a real production cluster.
+* API design. Please include your proposed HTTP API or gRPC API in the design doc. For the gRPC API, you should include a complete proto file in the design doc.
+* Security. Describe your mTLS setup in the design doc, including chosen cipher suites. Ensure that your implementation is secure.
+* Project management and scope. Manage your time wisely and ensure that the project scope aligns with the criteria for the level you're applying for. Avoid unnecessary complexity.
 
-Use hardcoded values as much as possible and simply add TODO items showing your
-thinking, for example:
+## Trade-offs
+Write as little code as possible to avoid letting the project consume too much time, which can impact code quality. Cut corners where appropriate; for example, avoid complex configuration and use hardcoded values. Add TODO items to indicate future enhancements or considerations, such as:
 
 ```
   // TODO: Add configuration system.
@@ -285,25 +198,16 @@ thinking, for example:
   // TODO: Add retry logic
 ```
 
-Comments like this one are really helpful to us. They save yourself a lot of
-time and demonstrate that you've spent time thinking about this problem and
-provide a clear path to a solution.
-
-Consider making other reasonable trade-offs. Make sure you communicate them to
-the interview team.
-
-Do not implement a system that scales outside of the scope of this challenge.
-For example, it is not necessary to deploy a shared cached or support deployment to multiple regions or AZs.
+Comments like this one are really helpful. They save yourself time and demonstrate that you've spent time thinking about this problem and provide a clear path toward a longer-term solution.
 
-## Pitfalls and Gotchas
+Consider making other reasonable trade-offs. Make sure you communicate them to the interview team.
 
+### Pitfalls and Gotchas
 To help you out, we've composed a list of things that previously resulted in a no-pass from the interview team:
 
 * Scope creep. Candidates have tried to implement too much and ran out of time
   and energy. To avoid this pitfall, use the simplest solution that will work.
-  Avoid writing too much code. We've seen candidates' code introducing caching
-  and making many mistakes in the caching layer validation logic. Not having
-  caching would have solved this problem.
+  Avoid writing too much code. For example, it is not necessary to deploy a shared cached or support deployment to multiple regions or AZs.
 * Overly complex designs. Keep things simple and try and eliminate as many moving
   parts as possible. This is not only going to help in reviewing the solution,
   but is also often a way to distill a design to its essential parts.
@@ -323,34 +227,16 @@ questions to ask and questions we expect candidates to figure out on their own.
 Here is a great question to ask:
 
 > Is it OK to assume there will be only a single target kubernetes cluster for
-  this service? I will add a note on how support for multiple clusters could be
-  implemented, but it feels like an unnecessary complexity.
+this service? I will add a note on how support for multiple clusters could be
+implemented, but it feels like an unnecessary complexity.
 
 This is the question we expect candidates to figure out on their own:
 
 > What version of Go should I use? What dependency manager should I use? What
-  framework/tool should I use to automate testing and deployment ?
+framework/tool should I use to automate testing and deployment ?
 
 Unless specified in the requirements, pick the solution that works best for you.
 
-# Tools
-
-This task should be implemented in Go and should work on a x86 64-bit Linux or MacOS
-machine.
-
-It's safe to assume a working Docker environment will be available locally as well.
-
-Additional external dependencies are acceptable,
-but please ensure detecting or installing the required/missing dependencies
-is as low friction as possible for the user/reviewer.
-
-# Timing
-
-You can split coding over a couple of weekdays or weekends and find time to ask
-questions and receive feedback.
-
-Once you join the Slack channel, you have up to 2 weeks to complete the
-challenge.
+## Code and project ownership
 
-Within this time frame, we don't give higher scores to challenges submitted more
-quickly. We only evaluate the quality of the submission.
+This is a test challenge, and we have no intention of using the code in production. The work is yours, and you’re free to handle it as you see fit. If you don’t pass, you’re welcome to open-source it under any license and use it as a portfolio project.