sidebar_position |
---|
6 |
It is important to be able to investigate problems that occur in production. Team members have typically used one of these implementations:
- existing Application Performance Management offering
- custom solution leveraging available platform tools (for example, those provided by a Cloud provider or OpenShift)
Irrespective of the implementation, the experience of the team is that it is important to plan for problem determination up front and use metrics to identify when problems occur and need to be investigated. You can read more about the suggested approach for capturing metrics in the metrics section.
Once your metrics have identified that there is a problem, it will often fall into one of the common categories.
The teams recommendations/guidance is based on the implementation and typical categories of problems that we have seen.
N/A
-
Symptoms
- metrics report that application is slowly becoming less responsive
- metrics report heap size is continuously growing despite stable load on the application
- metrics report increasing time spent running the garbage collector
-
Approach
- generate a sequence of heap dumps at 1 minute intervals, compare using chrome dev tools to see what is growing
- enable the heap profiler and look at where allocations are taking place
Generating heap dumps will have a performance impact on the process both in terms of memory and execution when the dump is being generated. Some of the team's suggestions for limiting the impact:
- only enable heap dumps for one process/instance of the application, leaving the others to better server customers
- try to minimize the size of the heap dumps. For example, if you increased the memory allocated to the application as part of trying to investigate the issue, revert that change. If possible limit the instance of the application to using less memory than you normally would.
- make sure you have enough additional memory available on the machine running the process. The dump may need double the size of heap while it is being generated.
- make sure to disable Kubernetes liveness checks (may require new deployment with that setting) or the periodic liveness check may kill your app before it finishes creating the heap dump.
-
Symptoms
- process health checks fail
- metrics report longer than expected request times
- metrics report higher than expected CPU usage
-
Approach
- review application logs, turn on additional levels of logging if necessary.
- If you have log level with profiling logs, urn those on.
- Review transaction traces
- generate a Diagnostic report and look for red flags
- generate a flame graph. The team has had success with 0x and bubbleprof and Flame
It may be difficult to investigate performance issues in the production system. Once you have identified that there is a problem, being able to recreate in another environment, that is similar to the production, make it much easier. The team has had success using goreplay as well as Autocannon to generate load in test environments in order to reproduce issue seen in production.
If you must investigate in production, you may have to disable or extend the health check period so that you can generate diagnostic reports and/or flame graphs. Ensure that you don't allow more than one process to be in the hung/slow state while investigating in order to limit impact on end users.
-
Symtoms
- metrics report failure reponses
-
Approach
- Review application logs, turn on additional levels of logging if necessary
- Review transaction traces
If you write your application in a language like TypeScript as opposed to plain JavaScript, raw stack traces may not map directly to your source code. In these cases tools typically support the generation of source maps which can help map back to the original source code.
Source maps should be generated and stored with the original source code for each release. For front-end JavaScript teams should validate the process of using offline source maps in their typical problem investigation workflow. This article has some suggestions for doing that.
-
Symtoms
- process terminates reporting an unhandled rejection or exception
-
Approach
- review log to find stack strace, review code indicated in stack trace to look for cause.
-
Symptoms
- process fails reporting lack of resources (for example sockets)
-
Approach
- generate a sequence of Diagnostic reports.
- compare resources reported between reports using report-toolkit
-
Symptoms
- external health checks fail, while internal metrics look ok
-
Approach
- use curl to check connectivity/responsiveness from different parts of network
- if using istio check logs
-
Symptoms
- Unexpected process crash/restart, logs indicates native crash (SIGSEV or equivalent)
-
Approach
- Enable core dump collection
- Analyze core dump with llnode
If you are operating a service where you own all deployments an existing Application Performance Management (APM) solution can make sense if budget is not an issue. (they can become expensive for large deployments). They offer the advantage of requiring less up front investment and leveraging a solution designed to help investigate problems. The team has had success with using Dynatrace, Instana, and NewRelic
If you develop applications that will be operated by customers, adding a dependency on a specific APM for problem determination is not recommended. The cost may be an issue for some customers, while others may already have standardized on a specific APM. Larger organizations in particular are likely to want to leverage their APM of choice across applications. It is recommended that whenever possible you leverage standard clients that will be able to feed data to different APMs and/or customer solutions.
Typically APMs will be able to help you generate and consume logs, metrics and traces as well as generate HeapDumps in order to support problem determination for the common problems outlined above.
If you rely on an APM, It is important to ensure you have enough licences so that you can use the APM for local development/debugging in addition to production deployments.
Platform tools will provide some support for generating key metrics as well as storing logs, and metrics. The Node.js Performance Timing API and more specifically performance.now() may be useful in capturing the time events occur.
The level of support available will depend on the platform with cloud platforms and kubernetes based platforms (for example OpenShift) providing more built in capabilities than an operating system. The team has had success in using Prometheus and Grafana to store, graph and analyse metrics and logs on platforms that do not have integrated support.
You will need to plan for how you instrument the application in order to capture logs, metrics. and traces with our recommendations outlined in those sections.
You will also need to plan for how to:
- generate and extract heap snapshots
- generate and extract core dumps
- dynamically change log levels
Cloud platforms also often provide templates for these activities so if you are on a specific cloud platform look for the corresponding templates.
A key consideration is how to make the additional components required for problem determination available. Ephemeral containers may help in the future, but today you most likely need to include the components in your deployment or pre-plan/approve the process to add them to the production environment when needed.
Generating and extracting heap snapshots
The team has used a few approaches to generate and extract heap snapshots:
- Approach 1
- Use the
--heapsnapshot-signal=SIGUSR2
command line option to enable heap dump generation onSIGUSR2
- Add heap-dump script in the applications package.json to trigger SIGUSR2 with kill
- use
kubectl exec
to run the heap-dump script and thenkubectl cp
to copy out the dump (or equivalent in non kubernetes environments)
- Approach 2
- adding a hidden rest call to the application to trigger a heap dump (only accessible internally)
- when the rest call is invoked generate heap dump into shared mount
Generating and extracting core dumps
The team has succesfully used core-dump-handler for generating and extracting core dumps in kubernetes environments.
Dynamically change log levels
The team has used a hidden rest call in the application which allows changing the log level. See Node.js Reference Architecture, Part 2: Logging in Node.js for more details.