Skip to content

Latest commit

 

History

History
237 lines (171 loc) · 11.5 KB

problem-determination.md

File metadata and controls

237 lines (171 loc) · 11.5 KB
sidebar_position
6

Problem Determination

It is important to be able to investigate problems that occur in production. Team members have typically used one of these implementations:

  • existing Application Performance Management offering
  • custom solution leveraging available platform tools (for example, those provided by a Cloud provider or OpenShift)

Irrespective of the implementation, the experience of the team is that it is important to plan for problem determination up front and use metrics to identify when problems occur and need to be investigated. You can read more about the suggested approach for capturing metrics in the metrics section.

Once your metrics have identified that there is a problem, it will often fall into one of the common categories.

The teams recommendations/guidance is based on the implementation and typical categories of problems that we have seen.

Recommended Components

N/A

Guidance

Common problems

Memory Leak

  • Symptoms

    • metrics report that application is slowly becoming less responsive
    • metrics report heap size is continuously growing despite stable load on the application
    • metrics report increasing time spent running the garbage collector
  • Approach

    • generate a sequence of heap dumps at 1 minute intervals, compare using chrome dev tools to see what is growing
    • enable the heap profiler and look at where allocations are taking place

Generating heap dumps will have a performance impact on the process both in terms of memory and execution when the dump is being generated. Some of the team's suggestions for limiting the impact:

  • only enable heap dumps for one process/instance of the application, leaving the others to better server customers
  • try to minimize the size of the heap dumps. For example, if you increased the memory allocated to the application as part of trying to investigate the issue, revert that change. If possible limit the instance of the application to using less memory than you normally would.
  • make sure you have enough additional memory available on the machine running the process. The dump may need double the size of heap while it is being generated.
  • make sure to disable Kubernetes liveness checks (may require new deployment with that setting) or the periodic liveness check may kill your app before it finishes creating the heap dump.

Hangs or slow performance

  • Symptoms

    • process health checks fail
    • metrics report longer than expected request times
    • metrics report higher than expected CPU usage
  • Approach

It may be difficult to investigate performance issues in the production system. Once you have identified that there is a problem, being able to recreate in another environment, that is similar to the production, make it much easier. The team has had success using goreplay as well as Autocannon to generate load in test environments in order to reproduce issue seen in production.

If you must investigate in production, you may have to disable or extend the health check period so that you can generate diagnostic reports and/or flame graphs. Ensure that you don't allow more than one process to be in the hung/slow state while investigating in order to limit impact on end users.

Application failures

  • Symtoms

    • metrics report failure reponses
  • Approach

    • Review application logs, turn on additional levels of logging if necessary
    • Review transaction traces

If you write your application in a language like TypeScript as opposed to plain JavaScript, raw stack traces may not map directly to your source code. In these cases tools typically support the generation of source maps which can help map back to the original source code.

Source maps should be generated and stored with the original source code for each release. For front-end JavaScript teams should validate the process of using offline source maps in their typical problem investigation workflow. This article has some suggestions for doing that.

Unhandled promise rejections or exceptions

  • Symtoms

    • process terminates reporting an unhandled rejection or exception
  • Approach

    • review log to find stack strace, review code indicated in stack trace to look for cause.

Resource Leaks

  • Symptoms

    • process fails reporting lack of resources (for example sockets)
  • Approach

Network Issues

  • Symptoms

    • external health checks fail, while internal metrics look ok
  • Approach

    • use curl to check connectivity/responsiveness from different parts of network
    • if using istio check logs

Native crashes

  • Symptoms

    • Unexpected process crash/restart, logs indicates native crash (SIGSEV or equivalent)
  • Approach

    • Enable core dump collection
    • Analyze core dump with llnode

Implementation

Application Performance Management Solutions

If you are operating a service where you own all deployments an existing Application Performance Management (APM) solution can make sense if budget is not an issue. (they can become expensive for large deployments). They offer the advantage of requiring less up front investment and leveraging a solution designed to help investigate problems. The team has had success with using Dynatrace, Instana, and NewRelic

If you develop applications that will be operated by customers, adding a dependency on a specific APM for problem determination is not recommended. The cost may be an issue for some customers, while others may already have standardized on a specific APM. Larger organizations in particular are likely to want to leverage their APM of choice across applications. It is recommended that whenever possible you leverage standard clients that will be able to feed data to different APMs and/or customer solutions.

Typically APMs will be able to help you generate and consume logs, metrics and traces as well as generate HeapDumps in order to support problem determination for the common problems outlined above.

If you rely on an APM, It is important to ensure you have enough licences so that you can use the APM for local development/debugging in addition to production deployments.

Custom solution leveraging platform tools

Platform tools will provide some support for generating key metrics as well as storing logs, and metrics. The Node.js Performance Timing API and more specifically performance.now() may be useful in capturing the time events occur.

The level of support available will depend on the platform with cloud platforms and kubernetes based platforms (for example OpenShift) providing more built in capabilities than an operating system. The team has had success in using Prometheus and Grafana to store, graph and analyse metrics and logs on platforms that do not have integrated support.

You will need to plan for how you instrument the application in order to capture logs, metrics. and traces with our recommendations outlined in those sections.

You will also need to plan for how to:

  • generate and extract heap snapshots
  • generate and extract core dumps
  • dynamically change log levels

Cloud platforms also often provide templates for these activities so if you are on a specific cloud platform look for the corresponding templates.

A key consideration is how to make the additional components required for problem determination available. Ephemeral containers may help in the future, but today you most likely need to include the components in your deployment or pre-plan/approve the process to add them to the production environment when needed.

Generating and extracting heap snapshots

The team has used a few approaches to generate and extract heap snapshots:

  • Approach 1
  1. Use the --heapsnapshot-signal=SIGUSR2 command line option to enable heap dump generation on SIGUSR2
  2. Add heap-dump script in the applications package.json to trigger SIGUSR2 with kill
  3. use kubectl exec to run the heap-dump script and then kubectl cp to copy out the dump (or equivalent in non kubernetes environments)
  • Approach 2
  1. adding a hidden rest call to the application to trigger a heap dump (only accessible internally)
  2. when the rest call is invoked generate heap dump into shared mount

Generating and extracting core dumps

The team has succesfully used core-dump-handler for generating and extracting core dumps in kubernetes environments.

Dynamically change log levels

The team has used a hidden rest call in the application which allows changing the log level. See Node.js Reference Architecture, Part 2: Logging in Node.js for more details.

Further Reading

Introduction to the Node.js reference architecture: How to investigate 7 common problems in production