Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement new pganalyze.explain_analyze() helper #655

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

lfittl
Copy link
Member

@lfittl lfittl commented Dec 19, 2024

This executes on-demand query runs for the Query Tuning feature via
a helper function that prevents reading the table data directly, as well
as granting the collector user unnecessary permissions.

This is now the only way that query runs can execute EXPLAIN ANALYZE,
direct execution without the helper is intentionally not supported.

@lfittl lfittl requested a review from a team December 19, 2024 02:57
@keiko713
Copy link
Contributor

I think overall this is looking good (I admit that I haven't checked the SQL part).
Are we going to retire the enable_query_runner flag in the different PR? Also, we will (well, the app will) want to know if the helper function exists or not, so it would be great if it's reported as a collector config.

@lfittl
Copy link
Member Author

lfittl commented Dec 19, 2024

Are we going to retire the enable_query_runner flag in the different PR? Also, we will (well, the app will) want to know if the helper function exists or not, so it would be great if it's reported as a collector config.

Yeah, I was going to remove the flag in a different PR.

And good point re: the helper - though technically we can also get that via the schema function data that already gets reported.

The main reason to stay with that existing flow that I could see (vs also including it in collector info), is that this is a per-database helper, like the Index Advisor helpers. Storing this information in collector info for many databases might not be ideal?

@keiko713
Copy link
Contributor

The main reason to stay with that existing flow that I could see (vs also including it in collector info), is that this is a per-database helper, like the Index Advisor helpers. Storing this information in collector info for many databases might not be ideal?

Ah, interesting, so you'll need to create a helper function to all databases (if you have many databases). Yeah in that case, sending out in the collector info doesn't make sense.
In that case, we may want to consider adding the function to https://github.com/pganalyze/collector/blob/main/runner/generate_helper_sql.go or providing some similar things for this.

I think I can workaround by checking the function definition on the app side 👍

@lfittl lfittl force-pushed the explain-function-helper branch 2 times, most recently from 0771dbe to c61e0e4 Compare December 24, 2024 04:52
TEST_DATABASE_URL: postgresql://postgres:postgres@localhost:5432/postgres?sslmode=disable
services:
postgres:
image: postgres:14
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using 14 as a minimum here since the EXPLAIN ANALYZE tests rely on being able to turn off compute_query_id in the output (and that was added in 14).

@lfittl
Copy link
Member Author

lfittl commented Dec 24, 2024

@keiko713 @seanlinsley FYI, this is ready for re-review. I ended up adding both parameter support and Postgres setting support here as well, since it was quick enough, and even if the server doesn't use it that works fine.

protobuf/server_message.proto Outdated Show resolved Hide resolved
util/sql_helpers.go Show resolved Hide resolved
input/postgres/explain_analyze.go Outdated Show resolved Hide resolved
input/postgres/explain_analyze.go Outdated Show resolved Hide resolved
input/postgres/explain_analyze.go Outdated Show resolved Hide resolved
main.go Outdated Show resolved Hide resolved
util/explain_analyze_helper.sql Outdated Show resolved Hide resolved
util/explain_analyze_helper.sql Outdated Show resolved Hide resolved
runner/query_run.go Show resolved Hide resolved
main.go Outdated Show resolved Hide resolved
This executes on-demand query runs for the Query Tuning feature via
a helper function that prevents reading the table data directly, as well
as granting the collector user unnecessary permissions.

This is now the only way that query runs can execute EXPLAIN ANALYZE,
direct execution without the helper is intentionally not supported.
Previously this was using "defer db.Close()" in a loop, which would
cause the database connections to only be released once the outside
function terminates, meaning connections were kept open too long.

To close each connection correctly when the run finishes, and to reduce
the duplication of some of the error handling, use a dedicated function
to process the parts of a query run that use a database connection.
@lfittl lfittl force-pushed the explain-function-helper branch from 839f4df to 7014175 Compare December 27, 2024 23:30
@lfittl lfittl force-pushed the explain-function-helper branch from 7014175 to c4de2a7 Compare December 27, 2024 23:48
Like "--generate-stats-helper-sql" this allows generating a SQL script
that installs the pganalyze.explain_analyze helper into all databases on
a server, as specified by the server section name passed as an argument.

Additionally, the "--generate-helper-explain-analyze-role" option allows
setting the role that the helper function should be owned by as a value,
defaulting to "pganalyze_explain".

Note that the role creation, as well as any GRANT statements for the role
must be taken care of separately.
@lfittl lfittl force-pushed the explain-function-helper branch from c4de2a7 to 8fcc1f9 Compare December 27, 2024 23:53
@lfittl lfittl force-pushed the explain-function-helper branch from dc15ca0 to f192625 Compare December 28, 2024 00:08
@lfittl
Copy link
Member Author

lfittl commented Dec 28, 2024

@seanlinsley @keiko713 This is ready for re-review, all feedback should be addressed.

runner/websocket.go Outdated Show resolved Hide resolved
runner/query_run.go Show resolved Hide resolved
@@ -130,6 +130,34 @@ func Run(ctx context.Context, wg *sync.WaitGroup, globalCollectionOpts state.Col
return
}

if globalCollectionOpts.GenerateExplainAnalyzeHelperSql != "" {
wg.Add(1)
testRunSuccess = make(chan bool)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't have to be part of this PR, but we may want to add test run output for the presence of the helper function and success on running a test explain.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a good point. I'll add this in a follow-up PR.

PERFORM 1 FROM pg_roles WHERE (rolname = current_user AND rolsuper) OR (pg_has_role(oid, 'MEMBER') AND rolname IN ('rds_superuser', 'azure_pg_admin', 'cloudsqlsuperuser'));
IF FOUND THEN
RAISE EXCEPTION 'cannot run: helper is owned by superuser - recreate function with lesser privileged user';
END IF;
Copy link
Member

@seanlinsley seanlinsley Dec 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error occurs in local testing because my Postgres user is a superuser. Maybe the OR should be an AND instead?

For reference, these are the values for the different where clauses:

select rolsuper, pg_has_role(oid, 'MEMBER'), rolname IN ('rds_superuser', 'azure_pg_admin', 'cloudsqlsuperuser') from pg_roles where rolname = current_user;
 rolsuper | pg_has_role | ?column? 
----------+-------------+----------
 t        | t           | f

If the logic in this check is intentional, what should I do to my local Postgres install to work around it, and how should we communicate this issue in the docs?

Copy link
Member Author

@lfittl lfittl Dec 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you follow the instructions on setting up the separate user (preview) it should work as expected, even in a local setup. Its intentionally an OR, since we want to block users that run Postgres on a VM to use a real superuser for this.

server.QueryRunsMutex.Unlock()

for name, value := range query.PostgresSettings {
_, err = db.ExecContext(ctx, postgres.QueryMarkerSQL+"SET %s = %s", pq.QuoteIdentifier(name), pq.QuoteLiteral(value))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When testing this locally with statement_timeout set to 1s, this fails with pq: syntax error at or near "%". Looks like a fmt.Sprintf call is needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, strange, I must have dropped that in a refactoring by accident.

Copy link
Member Author

@lfittl lfittl Dec 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now works as expected for your test case (that you had in the other PR).

That said, I do wonder if we should explicitly block changing statement_timeout through the parameters in the query run, or at least have some higher limit that's a hard limit? (i.e. can't be changed by the server side)

Maybe something like 1 minute as the default limit, can be raised to up to 10 minutes by the server side, but nothing higher?

Assuming we'd want to do that in the future, its likely easier to enforce by passing it as a separate value (not a free form parameter, i.e block it in postgres_settings), so we don't have to deal with 20min vs 20s parameter value parsing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed it makes sense to block it here and make it a standalone setting so we can validate it. But applying a max value is going to depend on the type of query run that's being performed; if in the future we're doing a concurrent reindex it should be fine for that to run for an arbitrary amount of time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, agreed. I'll filter it out here for now and we can add customizing it in a separate PR.

Copy link
Member

@seanlinsley seanlinsley Dec 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually on second thought a separate field isn't really needed: we can validate the value in postgres_settings since the map key is always going to be named statement_timeout.

For now though, we can bypass this issue by making the statement timeout non-configurable. That can be done by moving the existing postgres.SetStatementTimeout call after this code, so the hardcoded 1 minute timeout is always used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually on second thought a separate field isn't really needed: we can validate the value in postgres_settings since the map key is always going to be named statement_timeout.

The problem is not the key, but the value - since you can use different duration units (seconds vs minutes), we'd have to parse them in order to validate the value.

For now though, we can bypass this issue by making the statement timeout non-configurable. That can be done by moving the existing postgres.SetStatementTimeout call after this code, so the hardcoded 1 minute timeout is always used.

Sure, I'll move it down for now so it takes effect after the other settings.

Per the Postgres documentation at https://www.postgresql.org/docs/current/plpgsql-control-structures.html#PLPGSQL-ERROR-TRAPPING
"The special condition name OTHERS matches every error type except
QUERY_CANCELED and ASSERT_FAILURE.", thus we need to explicitly include
the QUERY_CANCELED in order to deallocate the statement in case of
statement timeouts. Since we don't use asserts, we do not need to catch
ASSERT_FAILURE.
Since we're currently not filtering which connection level parameters
get set on the collector side, this ensures that we enforce the
statement_timeout to 60 secs even if the server sends a different value.
@seanlinsley
Copy link
Member

This looks good overall, but I did run into one bug: the statement timeout doesn't seem to take effect. For example if I run a SELECT pg_sleep(90) query through the system, the collector responds back with JSON including "Execution Time": 90001.987.

@lfittl
Copy link
Member Author

lfittl commented Jan 2, 2025

This looks good overall, but I did run into one bug: the statement timeout doesn't seem to take effect. For example if I run a SELECT pg_sleep(90) query through the system, the collector responds back with JSON including "Execution Time": 90001.987.

Strange - I had tested that and it worked for me, but I can recheck tomorrow. It sounds like you got an actual result vs the EXPLAIN without ANALYZE that should happen when the query times out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants