Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stats/opentelemetry: Introduce Tracing API #7852

Open
wants to merge 21 commits into
base: master
Choose a base branch
from
Open

Conversation

aranjans
Copy link
Contributor

@aranjans aranjans commented Nov 18, 2024

This pull request adds the OpenTelemetry tracing API to the grpc-go opentelemetry plugin as outlined in proposal A72.

RELEASE NOTES:

  • stats/opentelemetry: Introduces an experimental API for enabling and configuring OpenTelemetry tracing within gRPC-go opentelemetry plugin under stats/opentelemetry/experimental. This includes the addition of TraceOptions in the Options struct to allow users to specify the TraceProvider and TextMapPropagator.

@aranjans aranjans added this to the 1.69 Release milestone Nov 18, 2024
@aranjans aranjans added Type: Feature New features or improvements in behavior Area: Observability Includes Stats, Tracing, Channelz, Healthz, Binlog, Reflection, Admin, GCP Observability labels Nov 18, 2024
Copy link

codecov bot commented Nov 18, 2024

Codecov Report

Attention: Patch coverage is 81.75182% with 25 lines in your changes missing coverage. Please review.

Project coverage is 82.05%. Comparing base (56a14ba) to head (5571e3b).
Report is 8 commits behind head on master.

Files with missing lines Patch % Lines
stats/opentelemetry/trace.go 75.00% 9 Missing and 3 partials ⚠️
stats/opentelemetry/client_metrics.go 85.71% 5 Missing and 2 partials ⚠️
stats/opentelemetry/client_tracing.go 72.72% 2 Missing and 1 partial ⚠️
stats/opentelemetry/server_tracing.go 78.57% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7852      +/-   ##
==========================================
- Coverage   82.09%   82.05%   -0.04%     
==========================================
  Files         379      384       +5     
  Lines       38261    38652     +391     
==========================================
+ Hits        31409    31716     +307     
- Misses       5549     5611      +62     
- Partials     1303     1325      +22     
Files with missing lines Coverage Δ
...elemetry/experimental/grpc_trace_bin_propagator.go 87.50% <ø> (ø)
stats/opentelemetry/opentelemetry.go 76.51% <100.00%> (+0.64%) ⬆️
stats/opentelemetry/server_metrics.go 89.82% <100.00%> (+0.44%) ⬆️
stats/opentelemetry/client_tracing.go 72.72% <72.72%> (ø)
stats/opentelemetry/server_tracing.go 78.57% <78.57%> (ø)
stats/opentelemetry/client_metrics.go 86.47% <85.71%> (-1.46%) ⬇️
stats/opentelemetry/trace.go 75.00% <75.00%> (ø)

... and 25 files with indirect coverage changes

@aranjans aranjans requested a review from purnesh42H November 18, 2024 08:22
@purnesh42H purnesh42H self-assigned this Nov 19, 2024
examples/features/opentelemetry/client/main.go Outdated Show resolved Hide resolved
gcp/observability/go.mod Outdated Show resolved Hide resolved
clientconn.go Outdated Show resolved Hide resolved
interop/xds/go.sum Outdated Show resolved Hide resolved
@@ -119,22 +134,46 @@ func (h *clientStatsHandler) streamInterceptor(ctx context.Context, desc *grpc.S
}

startTime := time.Now()
ctx, span := h.createCallTraceSpan(ctx, method)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we should just not call createCallTraceSpan if tracing is disabled

Copy link
Contributor Author

@aranjans aranjans Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

stats/opentelemetry/client_metrics.go Outdated Show resolved Hide resolved
stats/opentelemetry/client_metrics.go Outdated Show resolved Hide resolved
stats/opentelemetry/client_metrics.go Outdated Show resolved Hide resolved
stats/opentelemetry/trace.go Outdated Show resolved Hide resolved
stats/opentelemetry/opentelemetry.go Outdated Show resolved Hide resolved
@purnesh42H
Copy link
Contributor

@aranjans please update this with latest main branch to get rid of all go.mod and go.sum changes

@purnesh42H purnesh42H assigned aranjans and unassigned purnesh42H Nov 20, 2024
@aranjans aranjans force-pushed the a72 branch 3 times, most recently from 378ddd3 to 8cb8222 Compare November 21, 2024 17:44
@aranjans aranjans assigned purnesh42H and unassigned aranjans Nov 22, 2024
@aranjans
Copy link
Contributor Author

@purnesh42H Thanks for your review, I have addressed all your comments and this PR is ready for another pass.

Copy link
Contributor

@purnesh42H purnesh42H left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more non-test comments. Will review tests in next pass

clientconn.go Outdated Show resolved Hide resolved
stats/handlers.go Outdated Show resolved Hide resolved
stats/opentelemetry/client_metrics.go Outdated Show resolved Hide resolved
stream.go Outdated Show resolved Hide resolved
stats/opentelemetry/client_metrics.go Outdated Show resolved Hide resolved
stats/opentelemetry/client_metrics.go Outdated Show resolved Hide resolved
stats/opentelemetry/client_metrics.go Outdated Show resolved Hide resolved
stats/opentelemetry/client_metrics.go Outdated Show resolved Hide resolved
stats/opentelemetry/client_metrics.go Outdated Show resolved Hide resolved
Copy link
Contributor

@purnesh42H purnesh42H left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more non-test comments. Will review tests in next pass

@purnesh42H purnesh42H assigned aranjans and unassigned purnesh42H Nov 22, 2024
@purnesh42H purnesh42H changed the title Implement A72: OpenTelemetry Tracing stats/opentelemetry: Introduce Tracing API Nov 22, 2024
@purnesh42H
Copy link
Contributor

@aranjans you should link the gRFC and concise your description

@aranjans
Copy link
Contributor Author

aranjans commented Nov 23, 2024

@purnesh42H I have addressed all the comments, and updated the description to link the grfc proposal.
Feel free to close the thread which are resolved now.

@aranjans aranjans assigned purnesh42H and unassigned aranjans Nov 23, 2024
stats/opentelemetry/client_metrics.go Outdated Show resolved Hide resolved
stats/opentelemetry/client_metrics.go Outdated Show resolved Hide resolved
stats/opentelemetry/client_metrics.go Outdated Show resolved Hide resolved
stats/opentelemetry/client_metrics.go Outdated Show resolved Hide resolved
stats/opentelemetry/client_metrics.go Outdated Show resolved Hide resolved
stats/opentelemetry/opentelemetry.go Outdated Show resolved Hide resolved
stats/opentelemetry/server_metrics.go Outdated Show resolved Hide resolved
stats/opentelemetry/trace.go Outdated Show resolved Hide resolved
stats/opentelemetry/trace.go Outdated Show resolved Hide resolved
@aranjans
Copy link
Contributor Author

@purnesh42H I have addressed all your comments, and updated the PR. Kindly review the PR.

stream.go Outdated
@@ -213,13 +213,13 @@ func newClientStream(ctx context.Context, desc *StreamDesc, cc *ClientConn, meth
// Provide an opportunity for the first RPC to see the first service config
// provided by the resolver.
if err := cc.waitForResolvedAddrs(ctx); err != nil {
cc.nameResolutionDelayed = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this trying to capture? I don't think this is what you want.

For example, if name resolution takes 30 seconds to occur, and there is no deadline on the RPC, then this will not ever be set.

Also, making this a boolean field on the ClientConn can't be correct.

Copy link
Contributor

@purnesh42H purnesh42H Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its for capturing the delay in name resolution https://github.com/grpc/proposal/blob/master/A72-open-telemetry-tracing.md#tracing-information

I think we need to specifically look for context timeout error to set this? waitForResolvedAddrs doesn't directly calculate delay; it waits for resolved addresses, and if resolution takes too long, it will return a timeout error?

What is the expected behavior when there is no deadline?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal of the feature is to determine, if a name resolution delay occurred during the RPC's lifespan, how long that took.

So basically you want to add events to the lifecycle of the RPC when these things occur:

grpc-go/clientconn.go

Lines 682 to 684 in e957825

select {
case <-cc.firstResolveEvent.Done():
return nil

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dfawley and @purnesh42H for your thoughts, I will create a separate PR for this as a bonus feature.

@aranjans
Copy link
Contributor Author

@dfawley As per our offline discussion to exclude changes of adding event for name resolution delay, and create a separate PR. I have updated the PR.

stream.go Outdated
@@ -215,7 +215,6 @@ func newClientStream(ctx context.Context, desc *StreamDesc, cc *ClientConn, meth
if err := cc.waitForResolvedAddrs(ctx); err != nil {
return nil, err
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please just revert this file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Comment on lines 30 to 31
// TextMapPropagator propagates span context through text map carrier.
TextMapPropagator propagation.TextMapPropagator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the user supposed to be able to override this?

We have a default for them, though, right? That should get documented.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah GRPCTraceBinPropagator is what user can set if they are in the migration path i.e. using opencensus but its not default. The recommendation is to use W3CContextPropagator provided by otel if they are not using opencensus

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this go into experimental/opentelemetry, or even experimental/stats/opentelemetry? Either way there should be comments indicating that it will be moved (and where it will be moved to, and when).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Comment on lines 50 to 55
func init() {
otelinternal.SetPluginOption = func(o *Options, po otelinternal.PluginOption) {
o.MetricsOptions.pluginOption = po
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this back to where it was to avoid the code churn please?

@@ -66,6 +68,15 @@ func (h *serverStatsHandler) initializeMetrics() {
rm.registerMetrics(metrics, meter)
}

func (h *serverStatsHandler) initializeTracing() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this go into a new file? server_tracing.go?

@@ -171,6 +176,14 @@ func removeLeadingSlash(mn string) string {
return strings.TrimLeft(mn, "/")
}

func isMetricsDisabled(mo MetricsOptions) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code now looks like:

if !isMetricsDisabled() {
}

It should instead be:

if isMetricsEnabled() {
}

to avoid a double negative.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Comment on lines 76 to 77
otel.SetTextMapPropagator(h.options.TraceOptions.TextMapPropagator)
otel.SetTracerProvider(h.options.TraceOptions.TracerProvider)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, these are global settings. I don't think that's what we want... Only the user should ever be calling these kinds of function, not our library.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dfawley you're right, ideally user should set these variables. I have updated the PR, kindly check.

return mo.MeterProvider == nil
}

func isTracingDisabled(to experimental.TraceOptions) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should discuss options, but requiring these to be called multiple times from the stats handler feels problematic. Maybe there's a tracing stats handler and a metrics stats handler and we combine them based on which ones are enabled?

At the very least, these should become methods so we don't have to keep reaching into our data structures when calling them to find the options fields.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dfawley I have updated the PR with the later approach, as it makes sense to me for this to be method of Options struct. Let me know if you think this can be improved.

}
ri := &rpcInfo{
ai.startTime = startTime
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you change how this is set?

Why are you setting a local 3 lines above to time.Now() and then assigning this field to that?

Copy link
Contributor Author

@aranjans aranjans Dec 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated now, thAnks.

Comment on lines 211 to 212
ai := &attemptInfo{}
startTime := time.Now()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comments as above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@dfawley dfawley assigned aranjans and unassigned dfawley Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Observability Includes Stats, Tracing, Channelz, Healthz, Binlog, Reflection, Admin, GCP Observability Type: Feature New features or improvements in behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants