[NA] Split bulk operations in smaller size and making single ClickHouse call per subset #176

thiagohora · 2024-09-04T07:51:30Z

Details

Before:

After:

Resolves #

Testing

Documentation

…se call per subset

andrescrz

This workaround certainly seems to improve the situation. But on the other hand, it moves the responsibility to our service layer, for something such as batching which should be a pretty standard thing. Specially considering that we're using ClickHouse, which is very well suited for it.

I have the feeling that the previous implementation worked, but it was imply not the right way to implement a batch insert. Let's try first the normal batch feature in the R2DBC client as see how it goes:

andrescrz · 2024-09-04T09:20:54Z

apps/opik-backend/src/main/java/com/comet/opik/domain/DatasetItemDAO.java

-            return Flux.from(statement.execute());
+        String sql = template.render();
+
+        var statement = connection.createStatement(sql);


We should try a createBatch call and work from there.

As discussed over Slack, The problem with the batch implementation from R2dbc for batch is that it only accepts plain SQL strings. There is no binding for parameters; using string templates will open the door for SQL injections. That's why I implemented like this

Discussed offline. The support in the R2DBC is very limited as only allows to pass SQL statements as Strings, without the possibility of binding.

andrescrz

Give the current circumstances, it's ok to go with this as a temporary workaround.

I have the feeling that the problems might be derived from the lack of some configuration in ClickHouse to favour bulk inserts, such as buffer tables, async inserts etc:

https://clickhouse.com/docs/en/cloud/bestpractices/bulk-inserts

andrescrz · 2024-09-04T11:06:54Z

apps/opik-backend/src/main/java/com/comet/opik/domain/DatasetItemDAO.java

                ) AS new
                LEFT JOIN (
                    SELECT
                        *
                    FROM dataset_items
-                    WHERE id = :id


This is going to be problematic as it will result in a full scan of the table.

That is good point I will add workspace at least

andrescrz · 2024-09-04T11:07:34Z

apps/opik-backend/src/main/java/com/comet/opik/domain/DatasetItemDAO.java

@@ -379,47 +388,49 @@ public Mono<Long> save(@NonNull UUID datasetId, @NonNull List<DatasetItem> items
            return Mono.empty();
        }

-        return inset(datasetId, items)
-                .retryWhen(AsyncUtils.handleConnectionError());
+        return inset(datasetId, items);


Minor: typo here inset instead of insert.

andrescrz · 2024-09-04T11:09:27Z

apps/opik-backend/src/main/java/com/comet/opik/domain/DatasetItemDAO.java

    }

    private Mono<Long> inset(UUID datasetId, List<DatasetItem> items) {
-        return asyncTemplate.nonTransaction(connection -> {
+        List<List<DatasetItem>> batches = Lists.partition(items, bulkConfig.getSize());


This partitioning increases the chances of partial insertions, specially under error scenarios, and its derived problems.

Any idea on how far it is from the ClickHouse query size limit with batches of 1000s items? (for dataset item, for experiment item etc.).

Depending on the JSON size, we can increase the query size limit, but I believe we might need this in some way. I also thought about partial inserts, but then I checked, and transactions still need to be supported in the R2DBC driver. Once they are, we could involve all transaction requests or create a proper pipeline to keep track of batch inserts.

andrescrz · 2024-09-04T11:10:55Z

apps/opik-backend/src/main/java/com/comet/opik/domain/DatasetItemDAO.java

-            return Flux.from(statement.execute());
+        String sql = template.render();
+
+        var statement = connection.createStatement(sql);


Discussed offline. The support in the R2DBC is very limited as only allows to pass SQL statements as Strings, without the possibility of binding.

andrescrz · 2024-09-04T11:12:06Z

apps/opik-backend/src/main/java/com/comet/opik/utils/TemplateUtils.java

+    public static class QueryItem {
+        public final int index;
+        public final boolean hasNext;
+
+        public QueryItem(int index, boolean hasNext) {
+            this.index = index;
+            this.hasNext = hasNext;
+        }
+    }


Minor: use lombok.

andrescrz · 2024-09-04T11:12:39Z

apps/opik-backend/src/main/java/com/comet/opik/utils/TemplateUtils.java

+        }
+    }
+
+    public static List<QueryItem> getQueryItemPlaceHolder(Collection<?> items) {


Minor: only the size of the collection is really used.

[NA] Split bulk operations in smaller size and making single clickhou…

381155e

…se call per subset

thiagohora requested a review from a team as a code owner September 4, 2024 07:51

thiagohora self-assigned this Sep 4, 2024

andrescrz requested changes Sep 4, 2024

View reviewed changes

andrescrz approved these changes Sep 4, 2024

View reviewed changes

thiagohora merged commit 417d40b into main Sep 4, 2024
2 checks passed

thiagohora deleted the thiagohora/split_bulk_operations_in_smaller_size_and_making_single_clickhose_call_per_subset branch September 4, 2024 11:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NA] Split bulk operations in smaller size and making single ClickHouse call per subset #176

[NA] Split bulk operations in smaller size and making single ClickHouse call per subset #176

thiagohora commented Sep 4, 2024 •

edited

Loading

andrescrz left a comment •

edited

Loading

andrescrz Sep 4, 2024

thiagohora Sep 4, 2024

andrescrz Sep 4, 2024

andrescrz left a comment

andrescrz Sep 4, 2024

thiagohora Sep 4, 2024

andrescrz Sep 4, 2024

andrescrz Sep 4, 2024

thiagohora Sep 4, 2024

andrescrz Sep 4, 2024

andrescrz Sep 4, 2024

andrescrz Sep 4, 2024

[NA] Split bulk operations in smaller size and making single ClickHouse call per subset #176

[NA] Split bulk operations in smaller size and making single ClickHouse call per subset #176

Conversation

thiagohora commented Sep 4, 2024 • edited Loading

Details

Testing

Documentation

andrescrz left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrescrz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thiagohora commented Sep 4, 2024 •

edited

Loading

andrescrz left a comment •

edited

Loading