-
Notifications
You must be signed in to change notification settings - Fork 285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NA] Split bulk operations in smaller size and making single ClickHouse call per subset #176
Conversation
…se call per subset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This workaround certainly seems to improve the situation. But on the other hand, it moves the responsibility to our service layer, for something such as batching which should be a pretty standard thing. Specially considering that we're using ClickHouse, which is very well suited for it.
I have the feeling that the previous implementation worked, but it was imply not the right way to implement a batch insert. Let's try first the normal batch feature in the R2DBC client as see how it goes:
return Flux.from(statement.execute()); | ||
String sql = template.render(); | ||
|
||
var statement = connection.createStatement(sql); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should try a createBatch
call and work from there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed over Slack, The problem with the batch implementation from R2dbc for batch is that it only accepts plain SQL strings. There is no binding for parameters; using string templates will open the door for SQL injections. That's why I implemented like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline. The support in the R2DBC is very limited as only allows to pass SQL statements as Strings, without the possibility of binding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Give the current circumstances, it's ok to go with this as a temporary workaround.
I have the feeling that the problems might be derived from the lack of some configuration in ClickHouse to favour bulk inserts, such as buffer tables, async inserts etc:
) AS new | ||
LEFT JOIN ( | ||
SELECT | ||
* | ||
FROM dataset_items | ||
WHERE id = :id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to be problematic as it will result in a full scan of the table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is good point I will add workspace at least
@@ -379,47 +388,49 @@ public Mono<Long> save(@NonNull UUID datasetId, @NonNull List<DatasetItem> items | |||
return Mono.empty(); | |||
} | |||
|
|||
return inset(datasetId, items) | |||
.retryWhen(AsyncUtils.handleConnectionError()); | |||
return inset(datasetId, items); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: typo here inset
instead of insert
.
} | ||
|
||
private Mono<Long> inset(UUID datasetId, List<DatasetItem> items) { | ||
return asyncTemplate.nonTransaction(connection -> { | ||
List<List<DatasetItem>> batches = Lists.partition(items, bulkConfig.getSize()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This partitioning increases the chances of partial insertions, specially under error scenarios, and its derived problems.
Any idea on how far it is from the ClickHouse query size limit with batches of 1000s items? (for dataset item, for experiment item etc.).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on the JSON size, we can increase the query size limit, but I believe we might need this in some way. I also thought about partial inserts, but then I checked, and transactions still need to be supported in the R2DBC driver. Once they are, we could involve all transaction requests or create a proper pipeline to keep track of batch inserts.
return Flux.from(statement.execute()); | ||
String sql = template.render(); | ||
|
||
var statement = connection.createStatement(sql); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline. The support in the R2DBC is very limited as only allows to pass SQL statements as Strings, without the possibility of binding.
public static class QueryItem { | ||
public final int index; | ||
public final boolean hasNext; | ||
|
||
public QueryItem(int index, boolean hasNext) { | ||
this.index = index; | ||
this.hasNext = hasNext; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: use lombok.
} | ||
} | ||
|
||
public static List<QueryItem> getQueryItemPlaceHolder(Collection<?> items) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: only the size of the collection is really used.
Details
Before:
After:
Resolves #
Testing
Documentation