Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OPIK-99] Implement batching for spans creation #298

Merged
merged 26 commits into from
Sep 30, 2024

Conversation

alexkuzmik
Copy link
Collaborator

@alexkuzmik alexkuzmik commented Sep 20, 2024

Details

Added batching for CREATE-SPAN requests.
Batches are flushed to the streamer queue automatically every second or whenever accumulated message items reach max batch size limit (at the moment - 1000 items).

Batching works only inside the SDK, if someone tries to do the following

import opik
client = opik.Opik()

... batching will not work for that client (and all clients derived from it).

Under the hood of the SDK we instantiate Opik client like that

client = opik.Opik(_use_batching=True)

Issues

Resolves the issue when SDK sends a lot of CREATE-SPAN requests to the backend independetly, loading it and utilizing more network resources.
Backend should be less loaded,

Testing

Added unit tests. E2E tests also pass, but this PR requires extra QA.

Documentation

@alexkuzmik alexkuzmik requested a review from a team as a code owner September 20, 2024 16:00
@alexkuzmik alexkuzmik self-assigned this Sep 20, 2024
Copy link
Collaborator

@andrescrz andrescrz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment, but mostly LGTM. I had maybe some comment about optimising the batcher lock, but I don't want to create additional noise.

Once this is reviewed and validated, from my side we're good to go.

japdubengsub
japdubengsub previously approved these changes Sep 24, 2024
thiagohora
thiagohora previously approved these changes Sep 24, 2024
Copy link
Collaborator

@andrescrz andrescrz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, let's debug why span name is none with this change before proceeding with the merge.

Copy link
Collaborator

@andrescrz andrescrz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few more comments while the discussed items are implemented.

@@ -34,7 +34,7 @@ jobs:
- name: Run latest Opik server
run: |
cd deployment/docker-compose
docker compose up -d
docker compose up -d --build --force-recreate --no-deps
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's double check if these flags --force-recreate --no-deps were just a temporary fix or if we need it in the long run:

If the later, we need to propagate this everywhere.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@thiagohora thiagohora Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that --build only builds if the tag specified doesn't exist yet. However, all PR seems to run with the value.

https://github.com/comet-ml/opik/blob/main/deployment/docker-compose/docker-compose.yaml#L61

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This forces the rebuid of the image/tag

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will conflict with this PR:

.github/workflows/sdk-e2e-tests.yaml Show resolved Hide resolved
Comment on lines +65 to +66
args:
OPIK_VERSION: ${OPIK_VERSION:-latest}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, let's double check if this was temporary.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thiagohora ? I thought it's a part of the fix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, shit seems to be missing in our docker image. We such a variable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screenshot 2024-09-25 at 16 49 55

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liyaka please see this thread above.

Copy link
Collaborator

@andrescrz andrescrz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just wait on some input from @liyaka for the build fixes. Anything else LGTM.

@@ -34,7 +34,7 @@ jobs:
- name: Run latest Opik server
run: |
cd deployment/docker-compose
docker compose up -d
docker compose up -d --build --force-recreate --no-deps
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will conflict with this PR:

Comment on lines +65 to +66
args:
OPIK_VERSION: ${OPIK_VERSION:-latest}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liyaka please see this thread above.

@andrescrz andrescrz force-pushed the OPIK-99-sdk-implement-batching-for-spans-creation branch from 6357040 to ed29744 Compare September 30, 2024 13:59
andrescrz

This comment was marked as outdated.

@andrescrz andrescrz force-pushed the OPIK-99-sdk-implement-batching-for-spans-creation branch from 9d5d062 to 944fa9d Compare September 30, 2024 14:11
Copy link
Collaborator

@andrescrz andrescrz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Undone the e2e infra changes as they conflict and they should tackled separately.

@andrescrz andrescrz merged commit 939accf into main Sep 30, 2024
21 checks passed
@andrescrz andrescrz deleted the OPIK-99-sdk-implement-batching-for-spans-creation branch September 30, 2024 14:16
dsblank pushed a commit that referenced this pull request Oct 4, 2024
* Batch processing draft

* Batching draft

* [DRAFT] Update backend emulator for tests, update message processing to use batching

* Fix lint errors

* Implement _process_create_span_batch_message method in MessageProcessor

* Remove assert, add debug log message

* Add base batcher unit test

* Make batch manager flush all batches if streamer.flush() was called, add tests for span batcher

* Add tests for batch manager

* Add test for flushing thread

* Add one more test for batch manager

* Fix lint errors

* Fix lint errors

* Fix bug when backend emulator didn't add feedback scores to traces

* Fix lint errors

* Rename flush_interval to flush_interval_seconds

* Enable debug logs for e2e tests. Update debug messages in message_processors.py

* Remove export statement from e2e workflow

* Enable backend build. Enable debug logs for e2e tests

* Update docker compose and e2e workflow file

* Make batching disabled by default in Opik client. It is now enabled manually in Opik clients created under the hood of SDK

* Add more unit tests for message processing

* Add docstring for _use_batching parameter

* Add missing _SECONDS suffix

* Rename constant

* Undone e2e tests infra changes

---------

Co-authored-by: Andres Cruz <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants