Skip to content

Commit

Permalink
Ammendments to course
Browse files Browse the repository at this point in the history
* Remove top paragraph
As it doesn't provide any value
* Correct grammar and definition in Step 7
* Amend names of metrics in grafana
* Fix grammatical error in step 10
  • Loading branch information
Jessica White authored Sep 2, 2018
1 parent f69218f commit b3d792b
Show file tree
Hide file tree
Showing 5 changed files with 11 additions and 13 deletions.
2 changes: 1 addition & 1 deletion graphite-stack/step10.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,6 @@ Latency is useful in a number of ways:

Again, this suggests that something has gone awry ad that the calls are at risk of timing out.

The ideal situation is a short latency for both errors (if there are any) and successes, a high success rate and low no not error rate. There are numerous behaviours that can suggest things going wrong in the API just by these few baseline metrics.
The ideal situation is a short latency for both errors (if there are any) and successes, a high success rate and low error rate. There are numerous behaviours that can suggest things going wrong in the API just by these few baseline metrics.

You can now set up the baseline metrics for measuring the health of your APIs.
2 changes: 0 additions & 2 deletions graphite-stack/step3.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
## Let's get visualising

Though Graphite does have the functionality for creating visualisations, we will be using [Grafana](https://grafana.com/)

## Run Grafana

Now we have the ability to record our metrics, we need to be able to display them in a useful format. This is where Grafana comes in. Grafana is a dashboarding / visualisation UI that can be provided with a variety of data sources.
Expand Down
12 changes: 6 additions & 6 deletions graphite-stack/step7.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,25 @@
# RED and The Four Golden Signals

So far, we have set up StatsD, Graphite and Grafana, ran an API that returns given responses and now we have come to the point where we need to build our dashboard.
So far, we have set up StatsD, Graphite and Grafana and ran an API that returns given responses. Now we have come to the point where we need to build our dashboard.

As we have set up an API that is you can manipulate the response of, we can demonstrate some useful ways we can monitor the behaviour of API's and why.
As we have set up an API that you can manipulate the response of, we can demonstrate some useful ways we can monitor the behaviour of API's and why.

RED is discussed in [this blog written by Peter Bourgon](https://peter.bourgon.org/blog/2016/02/07/logging-v-instrumentation.html) is a mnemonic coined by Tom Wilkie. Much like [Brendan's Gregg USE methods for measuring system resources](http://www.brendangregg.com/usemethod.html), RED is suggested as baseline measurements for API's.
RED, which is discussed in [this blog written by Peter Bourgon],(https://peter.bourgon.org/blog/2016/02/07/logging-v-instrumentation.html) is a mnemonic coined by Tom Wilkie. Much like [Brendan's Gregg USE methods for measuring system resources and services](http://www.brendangregg.com/usemethod.html), RED is suggested as baseline measurements for API's.

- **R**ate - a count of requests over time

- **E**rror - a count of error over time

- **D**uration - the time between request and response.

These principles are echoed in the [Google Four Golden Signals mentioned in the Distributed Systems chapter of their SRE book](https://landing.google.com/sre/book/chapters/monitoring-distributed-systems.html).
These principles are echoed in the [Google Four Golden Signals mentioned in the Monitoring Distributed Systems chapter of their SRE book](https://landing.google.com/sre/book/chapters/monitoring-distributed-systems.html).

1. Latency - the time between request and response but with a focus of the difference between successful and erroring requests.

2. Traffic - requests per second. A measure of the load on the API.

3. Errors - same as before, a count of errors.

4. Saturation - measures of the systems utilisation. Is the memory, I/O or CPU reaching capacity for example.
4. Saturation - what percentage of your systems resources is currently in use.

In this next section, we shall build a dashboard reflecting the above principles focusing on the principles that overlap between these two frameworks. Saturation is difficult to set up for the current set up and within the timeframe of this demo.
In this next section, we shall build a dashboard reflecting the above principles focusing on the principles that overlap between these two frameworks. Saturation is difficult to demonstrate accurately with the set up and timeframe for this demo. Ideally you would need to have conducted load testing to find out the capacity of your systems, and set alerts or visual cues around when your are getting close enough to that capacity that you would need to take action.
4 changes: 2 additions & 2 deletions graphite-stack/step8.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,13 +10,13 @@ Select `graphite as your Data Source`.

First lets display the rate and latency of successful calls. We will put the response times and count of successful calls in the same graph.

The first query would be `stats` `timers` `response-api` `code` `200` `count`. In the functions for the same query select `transformNull()` `sumSeries()` `alias(count)` `response time
The first query would be `stats` `timers` `response-api` `code` `2*` `count`. In the functions for the same query select `transformNull()` `sumSeries()` `alias(2XX requests count)`

Ensure that the time scale set at the top is set to the past hour. You should see a count for the amount of times you clicked the Success Code command in the previous step.

Now add the erroring calls.

For the second query select `stats` `timers` `response-api` `code` `500` `count`. In the functions for the same query select `transformNull()` `sumSeries()` `alias(errors)` `response time,
For the second query select `stats` `timers` `response-api` `code` `5*` `count`. In the functions for the same query select `transformNull()` `sumSeries()` `alias(5XX requests count)`

You should see a count for the amount of times you clicked the command to return Internal Server Errors in the previous step.

Expand Down
4 changes: 2 additions & 2 deletions graphite-stack/step9.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@

We have yet to measure the duration from when each request is made, to when a response is returned.

First, to measure the duration of the successful response in the query select `stats` `timers` `response-api` `code` `200` `mean`. In the Functions for the same query select `transformNull()` `averageSeries()` `alias(success response time)`
First, to measure the duration of the successful response in the query select `stats` `timers` `response-api` `code` `2*` `mean`. In the Functions for the same query select `transformNull()` `averageSeries()` `alias(2XX response time)`

The erroring responses are very similar. For this query select `stats` `timers` `response-api` `mean` `500` `count`. In the Functions for the same query select `transformNull()` `averageSeries()` `alias(error response time)`
The erroring responses are very similar. For this query select `stats` `timers` `response-api` `code` `5*` `mean`. In the Functions for the same query select `transformNull()` `averageSeries()` `alias(5XX response time)`

Go to the _display_ tab and select series overrides. The first override will be for the successful response. Use the alias _success response time_ followed by `Lines: true` `Line fill:2` `Y-axis:2`

Expand Down

0 comments on commit b3d792b

Please sign in to comment.