Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NA] Split bulk operations in smaller size and making single ClickHouse call per subset #176

Conversation

thiagohora
Copy link
Contributor

@thiagohora thiagohora commented Sep 4, 2024

Details

Before:
Screenshot 2024-09-04 at 10 02 59

After:
Screenshot 2024-09-04 at 10 01 04

Resolves #

Testing

Documentation

@thiagohora thiagohora requested a review from a team as a code owner September 4, 2024 07:51
@thiagohora thiagohora self-assigned this Sep 4, 2024
Copy link
Collaborator

@andrescrz andrescrz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This workaround certainly seems to improve the situation. But on the other hand, it moves the responsibility to our service layer, for something such as batching which should be a pretty standard thing. Specially considering that we're using ClickHouse, which is very well suited for it.

I have the feeling that the previous implementation worked, but it was imply not the right way to implement a batch insert. Let's try first the normal batch feature in the R2DBC client as see how it goes:

return Flux.from(statement.execute());
String sql = template.render();

var statement = connection.createStatement(sql);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should try a createBatch call and work from there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed over Slack, The problem with the batch implementation from R2dbc for batch is that it only accepts plain SQL strings. There is no binding for parameters; using string templates will open the door for SQL injections. That's why I implemented like this

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline. The support in the R2DBC is very limited as only allows to pass SQL statements as Strings, without the possibility of binding.

Copy link
Collaborator

@andrescrz andrescrz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Give the current circumstances, it's ok to go with this as a temporary workaround.

I have the feeling that the problems might be derived from the lack of some configuration in ClickHouse to favour bulk inserts, such as buffer tables, async inserts etc:

) AS new
LEFT JOIN (
SELECT
*
FROM dataset_items
WHERE id = :id
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be problematic as it will result in a full scan of the table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is good point I will add workspace at least

@@ -379,47 +388,49 @@ public Mono<Long> save(@NonNull UUID datasetId, @NonNull List<DatasetItem> items
return Mono.empty();
}

return inset(datasetId, items)
.retryWhen(AsyncUtils.handleConnectionError());
return inset(datasetId, items);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: typo here inset instead of insert.

}

private Mono<Long> inset(UUID datasetId, List<DatasetItem> items) {
return asyncTemplate.nonTransaction(connection -> {
List<List<DatasetItem>> batches = Lists.partition(items, bulkConfig.getSize());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This partitioning increases the chances of partial insertions, specially under error scenarios, and its derived problems.

Any idea on how far it is from the ClickHouse query size limit with batches of 1000s items? (for dataset item, for experiment item etc.).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on the JSON size, we can increase the query size limit, but I believe we might need this in some way. I also thought about partial inserts, but then I checked, and transactions still need to be supported in the R2DBC driver. Once they are, we could involve all transaction requests or create a proper pipeline to keep track of batch inserts.

return Flux.from(statement.execute());
String sql = template.render();

var statement = connection.createStatement(sql);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline. The support in the R2DBC is very limited as only allows to pass SQL statements as Strings, without the possibility of binding.

Comment on lines +9 to +17
public static class QueryItem {
public final int index;
public final boolean hasNext;

public QueryItem(int index, boolean hasNext) {
this.index = index;
this.hasNext = hasNext;
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: use lombok.

}
}

public static List<QueryItem> getQueryItemPlaceHolder(Collection<?> items) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: only the size of the collection is really used.

@thiagohora thiagohora merged commit 417d40b into main Sep 4, 2024
2 checks passed
@thiagohora thiagohora deleted the thiagohora/split_bulk_operations_in_smaller_size_and_making_single_clickhose_call_per_subset branch September 4, 2024 11:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants