-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize to_history_dataset_association in create_datasets_from_library_folder #18970
Conversation
lib/galaxy/model/__init__.py
Outdated
@@ -6073,8 +6073,9 @@ def to_history_dataset_association(self, target_history, parent_id=None, add_to_ | |||
hda.metadata = self.metadata | |||
if add_to_history and target_history: | |||
target_history.add_dataset(hda) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add_dataset also causes a flush, the modern way of handling this is history.stage_addition
and then at the end history.add_pending_items
With optimizations like this it's usually a good idea to profile the changes so you can see a before/after, https://docs.galaxyproject.org/en/master/dev/finding_and_improving_slow_code.html has some pointers on that,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Importing a library folder with 26 datasets to the history
the results of profiling before optimization:
*** PROFILER RESULTS ***
_create (/home/arash/Galaxy/galaxy-arash/lib/galaxy/webapps/galaxy/api/history_contents.py:726)
function called 1 times
633147 function calls (621986 primitive calls) in 1.851 seconds
Ordered by: cumulative time, internal time, call count
List reduced from 1643 to 40 due to restriction <40>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 1.859 1.859 history_contents.py:726(_create)
1 0.000 0.000 1.859 1.859 history_contents.py:508(create)
1 0.001 0.001 1.857 1.857 history_contents.py:1136(__create_datasets_from_library_folder)
26 0.001 0.000 1.051 0.040 __init__.py:6051(to_history_dataset_association)
26 0.000 0.000 0.753 0.029 base.py:760(serialize_to_view)
26 0.000 0.000 0.753 0.029 hdas.py:637(serialize)
5321 0.006 0.000 0.705 0.000 attributes.py:552(__get__)
2455/1851 0.004 0.000 0.701 0.000 attributes.py:1063(get)
888/837 0.002 0.000 0.689 0.001 attributes.py:1108(_fire_loader_callables)
26 0.000 0.000 0.604 0.023 datasets.py:777(serialize)
26 0.004 0.000 0.601 0.023 base.py:686(serialize)
733 0.006 0.000 0.600 0.001 strategies.py:867(_load_for_state)
1731/631 0.006 0.000 0.561 0.001 state_changes.py:95(_go)
27 0.000 0.000 0.550 0.020 session.py:1992(commit)
53/27 0.000 0.000 0.550 0.020 <string>:1(commit)
53/27 0.001 0.000 0.550 0.020 session.py:1306(commit)
603 0.009 0.000 0.512 0.001 session.py:2134(_execute_internal)
577 0.001 0.000 0.493 0.001 session.py:2301(execute)
681 0.001 0.000 0.489 0.001 elements.py:506(_execute_on_connection)
681 0.006 0.000 0.487 0.001 base.py:1590(_execute_clauseelement)
655 0.001 0.000 0.474 0.001 base.py:1374(execute)
577 0.002 0.000 0.455 0.001 context.py:295(orm_execute_statement)
53 0.000 0.000 0.449 0.008 base.py:2614(commit)
53 0.001 0.000 0.449 0.008 base.py:2731(_do_commit)
53 0.000 0.000 0.448 0.008 base.py:2706(_connection_commit_impl)
53 0.000 0.000 0.448 0.008 base.py:1131(_commit_impl)
53 0.000 0.000 0.448 0.008 default.py:701(do_commit)
53 0.447 0.008 0.447 0.008 {method 'commit' of 'sqlite3.Connection' objects}
337 0.007 0.000 0.433 0.001 loading.py:526(load_on_pk_identity)
394 0.011 0.000 0.404 0.001 strategies.py:994(_emit_lazyload)
26 0.001 0.000 0.393 0.015 __init__.py:3329(add_dataset)
26 0.002 0.000 0.342 0.013 __init__.py:3294(_next_hid)
681 0.006 0.000 0.328 0.000 base.py:1791(_execute_context)
681 0.007 0.000 0.294 0.000 base.py:1850(_exec_single_context)
1573 0.005 0.000 0.281 0.000 {built-in method builtins.next}
53/27 0.000 0.000 0.277 0.010 <string>:1(_prepare_impl)
53/27 0.000 0.000 0.277 0.010 session.py:1271(_prepare_impl)
26 0.001 0.000 0.276 0.011 session.py:4322(flush)
26 0.001 0.000 0.275 0.011 session.py:4371(_flush)
681 0.001 0.000 0.267 0.000 default.py:940(do_execute)
the results of profiling after optimization:
*** PROFILER RESULTS ***
_create (/home/arash/Galaxy/galaxy-arash/lib/galaxy/webapps/galaxy/api/history_contents.py:726)
function called 1 times
485515 function calls (477225 primitive calls) in 0.814 seconds
Ordered by: cumulative time, internal time, call count
List reduced from 1734 to 40 due to restriction <40>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.819 0.819 history_contents.py:726(_create)
1 0.000 0.000 0.819 0.819 history_contents.py:508(create)
1 0.000 0.000 0.814 0.814 history_contents.py:1136(__create_datasets_from_library_folder)
26 0.000 0.000 0.537 0.021 base.py:760(serialize_to_view)
26 0.000 0.000 0.537 0.021 hdas.py:637(serialize)
26 0.000 0.000 0.486 0.019 datasets.py:777(serialize)
5139 0.005 0.000 0.484 0.000 attributes.py:552(__get__)
26 0.004 0.000 0.483 0.019 base.py:686(serialize)
2003/1460 0.004 0.000 0.480 0.000 attributes.py:1063(get)
777/751 0.002 0.000 0.470 0.001 attributes.py:1108(_fire_loader_callables)
672 0.006 0.000 0.411 0.001 strategies.py:867(_load_for_state)
337 0.010 0.000 0.348 0.001 strategies.py:994(_emit_lazyload)
446 0.007 0.000 0.341 0.001 session.py:2134(_execute_internal)
420 0.001 0.000 0.330 0.001 session.py:2301(execute)
420 0.002 0.000 0.306 0.001 context.py:295(orm_execute_statement)
210 0.005 0.000 0.259 0.001 loading.py:526(load_on_pk_identity)
449 0.001 0.000 0.245 0.001 elements.py:506(_execute_on_connection)
449 0.004 0.000 0.244 0.001 base.py:1590(_execute_clauseelement)
423 0.001 0.000 0.236 0.001 base.py:1374(execute)
1 0.000 0.000 0.171 0.171 history_contents.py:1175(<listcomp>)
449 0.003 0.000 0.150 0.000 elements.py:668(_compile_w_cache)
52 0.000 0.000 0.136 0.003 base.py:654(url_for)
52 0.000 0.000 0.136 0.003 __init__.py:206(__call__)
52 0.000 0.000 0.136 0.003 applications.py:106(url_path_for)
26 0.001 0.000 0.136 0.005 __init__.py:6051(to_history_dataset_association)
52 0.049 0.001 0.135 0.003 routing.py:656(url_path_for)
26 0.000 0.000 0.115 0.004 datasets.py:731(serialize_creating_job)
130 0.001 0.000 0.115 0.001 __init__.py:5120(creating_job)
81 0.001 0.000 0.104 0.001 loading.py:1597(load_scalar_attributes)
81 0.001 0.000 0.099 0.001 loading.py:487(load_on_ident)
449 0.004 0.000 0.089 0.000 base.py:1791(_execute_context)
29432 0.054 0.000 0.086 0.000 routing.py:268(url_path_for)
420 0.004 0.000 0.081 0.000 context.py:565(orm_setup_cursor_result)
26 0.000 0.000 0.079 0.003 hdas.py:609(<lambda>)
461 0.002 0.000 0.078 0.000 langhelpers.py:1265(oneshot)
449 0.001 0.000 0.076 0.000 cache_key.py:411(_generate_cache_key)
449 0.003 0.000 0.075 0.000 cache_key.py:347(_generate_cache_key)
2883/449 0.035 0.000 0.072 0.000 cache_key.py:221(_gen_cache_key)
420 0.008 0.000 0.072 0.000 loading.py:78(instances)
26 0.000 0.000 0.066 0.003 hdas.py:617(<lambda>)
sa_session.commit() | ||
target_history.stage_addition(hda) | ||
if flush: | ||
target_history.add_pending_items() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
target_history.add_pending_items() | |
target_history.add_pending_items() | |
flush = False |
you know your committing in there, so no need to do again on the next line.
Can you check where this method is used and see if not adding the dataset to a history and the flush=True ever make sense ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The flush argument for the add_datasets
function is always set to False
in the current context, which means it won't trigger a commit. Right?
Also, I noticed that in other places, the datasets are committed after using this function, for example, in this part of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right, I keep forgetting that. It's fine then. The second part still applies, can you check in which cases we do in fact want to commit ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't get what should look for exactly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flushing is different from committing: session.commit()
will write to the database AND commit the current transaction, whereas session.flush()
will write to the database transaction buffer without committing. What that means is that if we flush after adding or modifying a dataset, and on a subsequent commit there's any kind of db error that causes a rollback, whatever we flushed will be rolled back. If we commit, then the rollback on a subsequent commit won't affect the changes we have committed previously.
We have flush
parameters in our code for historical reasons: pre-2.0, SQLAlchemy would commit on flush (with our autocommit=True
setting), which is why the terms were used interchangeably. Now it's no longer the case; so my suggestion would be to use argument names consistent with what we do in the function (i.e., flush vs commit).
And with that said, I am not at all sure we should commit in that method :) The code @arash77 referenced replaced the pre-2.0 session.flush
: we did that to ensure there was no behavioral change (a commit stayed a commit). Otherwise with hundreds of session.flush
lines no longer committing, we'd never know what broke when moving to 2.0. HOWEVER, now that we have upgraded, we can take a hard look at all those commits and consider if we really need them, and, where possible, replace commits with flushes (unless we're optimizing like here, in which case we still can consider whether it's a commit that we need and not a flush).
Also, the code you referenced uses the transaction
context manager to ensure that if there is no active transaction, we begin one before trying to commit. If there is a preceding session.add
statement, there's guaranteed to be an active transaction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation! Just to clarify, for this case, should I change flush
to commit
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation! Just to clarify, for this case, should I change
flush
tocommit
?
Assuming you mean the name of the argument, if you intend to commit then yes, please. If you think flushing might be sufficient (so no session.commit()
, then keep it as flush. In other words, as long as the argument is consistent with the action.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I intend to commit here.
Thank you, then I will just rename the argument.
I think now it is fixed! |
This PR was merged without a "kind/" label, please correct. |
This pull request optimizes the
to_history_dataset_association
method in the__create_datasets_from_library_folder
function. The method now includes an additional parameterflush
which allows for more efficient handling of the transaction. This optimization improves the performance of creating datasets from a library folder.How to test the changes?
(Select all options that apply)
License