- Author(s): Rustin Liu
- Discussion PR: pingcap#49018
- Tracking Issue: pingcap#50132
- Priority Queue for Auto Analyze
Auto analyze is a background job that automatically collects statistics for tables. This design proposes a priority queue for auto analyze to improve the efficiency of auto analyze.
We have received numerous complaints in the past regarding the auto-analysis tasks. If users have many tables, we must assist them in analyzing the tables in the background to ensure up-to-date statistics are available for generating the best plan in the optimizer. There are some complaints from users:
- The automatic analysis retry mechanism has flaws.
- If one table analysis keeps failing, it will halt the entire auto-analysis process.
- Automatic analysis may encounter starvation issues.
- The analysis of small tables is being delayed due to the analysis of some large tables.
- Due to the random selection algorithm, some tables cannot be analyzed for a long time.
So we need to design a priority queue to solve these problems.
Before we design the priority queue, we need to know the current auto analyze process.
- During the TiDB bootstrap process, we initiate the load and update of the stats worker. session.go?L3470:15
- We spawn a dedicated worker to perform automated analysis tasks. domain.go?L2457:19
- For every stats lease, we will trigger an auto-analysis check: domain.go?L2469:17
Please note that we only run it on the TiDB owner instance.
- Check if we are currently within the allowed timeframe to perform the auto-analysis. autoanalyze.go?L149:15
- To prevent the process from getting stuck on the same table's analysis failing, randomize the order of databases and tables. autoanalyze.go?L155:5
- Try to execute the auto-analysis SQL. autoanalyze.go?L243:6
- If the statistics are pseudo, it means they have not been loaded yet, so there is no need to analyze them. autoanalyze.go?L247:5
- If the table is too small, it may not be worth analyzing. We should focus on larger tables. autoanalyze.go?L247:5
- If the table has never been analyzed, we need to analyze it. autoanalyze.go?L287:6
- If the table reaches the autoAnalyzeRatio, we need to analyze it. autoanalyze.go?L287:6
We use
modify_count/count
to calculate the ratio. autoanalyze.go?L301:40 - If the indexes have no statistics, we need to analyze them anyway. autoanalyze.go?L262:15
The above process is the current auto analyze process. We can see that the current auto analyze process is a random selection algorithm. We need to design a priority queue to solve the above problems.
Since we perform analysis tasks synchronously on a single node for each table, we need to carry out weighted sorting on the tables that need analysis.
- Only the total count of the table more than 1000 rows will be considered.
- We still use tidb_auto_analyze_ratio to determine if the table needs to be analyzed. The default value is 0.5. It means that auto-analyze is triggered when greater than 50% of the rows in a table have been modified.
- If a table is added to the auto-analysis queue, a weight must be assigned to it to determine the execution order.
The refresh time of the auto-analyze queue is not only determined by the frequency we set, but also depends on the execution time of the previous analysis task. Therefore, we can safely continue to use the old scheme of refreshing the queue every 3 seconds. This will only lead to excessive CPU consumption for auto-analyze checks when the entire cluster is idle, which is acceptable for us at the moment. In the future, we can completely solve this issue by updating the queue instead of rebuilding the entire queue.
Weight table:
Name | Meaning | Weight |
---|---|---|
Percentage of Change | The percentage of change since the last analysis. Note: For unanalyzed tables, we set the percentage of changes to 100%. | log10(1 + Change Ratio). Check the graph. Note: For unanalyzed tables, we set the percentage of changes to 100% |
Table Size | The size is equal to the number of rows multiplied by the number of columns in the table that are subject to analysis. The smaller tables should have a higher priority than the bigger ones. | Applying a logarithmic transformation, namely log10(1 + Table Size), and its 'penalty' is calculated as 1 - log10(1 + Table Size). Check the graph. |
Analysis Interval | Time since the last analysis execution for the table. The bigger interval should have a higher priority than the smaller interval. | Applying a logarithmic transformation, namely log10(1 + Analysis Interval). To further compress the rate of growth for larger values, we can consider taking the logarithmic square root of x'. The final formula is log10(1 + √Analysis Interval). Check the graph. |
Special Event | For example, the table has a new index but it hasn't been analyzed yet. | HasNewindexWithoutStats: 2 |
We need to design a formula that ensures the three variables (Change Ratio of the table, Size of the table, and the Time Interval since the last analysis) maintain specific proportions when calculating weights. Here is a basic idea:
- Table Change Ratio (Change Ratio): Accounts for 60%
- Table Size (Size): Accounts for 10%
- Analysis Interval (Analysis Interval): Accounts for 30%
The calculation formula is: priority_score = (priority_score = (change_percentage[size] * last_failed_time_weight[interval] * special_event[event])
The ratio mentioned above is determined based on our current sample data. We need more tests to ascertain a more accurate ratio. Furthermore, we will expose these ratios as some configurations, as more precise control and adjustments may be necessary in different scenarios.
For partitioned tables, if the pruning mode is static, then we don't need to merge the global statistics, so we can consider it as a normal table and calculate the weight for it. But if the pruning mode is dynamic, then we need to get all the partitions that need to be analyzed and calculate the average percentage of changes, and consider it as a single item in the priority queue.
Pseudocode:
function calculateAvgChangeForPartitions(partitionStats, defs, autoAnalyzeRatio):
totalChangePercent = 0
count = 0
partitionNames = []
for each def in defs:
tblStats = partitionStats[def.ID]
changePercent = calculateChangePercentage(tblStats, autoAnalyzeRatio)
if changePercent is 0:
continue
totalChangePercent += changePercent
append def.Name.O to partitionNames
count += 1
avgChange = totalChangePercent / count
return avgChange, partitionNames
function calculateChangePercentage(tblStats, autoAnalyzeRatio):
if tblStats.Pseudo or tblStats.RealtimeCount < AutoAnalyzeMinCnt:
return 0
if not TableAnalyzed(tblStats):
return 1
tblCnt = tblStats.RealtimeCount
if histCnt = tblStats.GetAnalyzeRowCount() > 0:
tblCnt = histCnt
res = tblStats.ModifyCount / tblCnt
if res > autoAnalyzeRatio:
return res
return 0
Sometimes we may encounter some problem when we analyze a table, we need to avoid analyzing the same table again and again. So after we select a table from the priority queue. We need to make sure that it is valid to be analyzed. We check the interval between now and the last failed analysis ends time to determine if we need to analyze the selected table.
The calculation rule is: if interval >= 2 * average automatic analysis interval then we thought it was a valid table to be analyzed. We only compute it after we get it from the priority queue, which would help us save a lot of resources. Because getting this information from TiKV is very expensive.
Pseudocode:
function IsValidToAnalyze(j):
if j.Weight is 0:
return false
lastFailedAnalysisDuration, err1 = getLastFailedAnalysisDuration(j.DBName, j.TableName, "")
if err1 is not nil:
return false
averageAnalysisDuration, err2 = getAverageAnalysisDuration(j.DBName, j.TableName, "")
if err2 is not nil:
return false
// Failed analysis duration is less than 2 times the average analysis duration.
// Skip this table to avoid too many failed analysis.
if lastFailedAnalysisDuration < 2 * averageAnalysisDuration:
return false
return true
Pick One Table From The Priority Queue (default: every 3s)
-
How many auto-analysis tasks can we run at the same time in the cluster?
Only one. We only execute the auto-analysis background worker and task on the owner node.
-
What happens if we analyze the same table multiple times at the same time?
It will use the most recent successful analysis result.
-
Why don't we simply separate the queues for large tables and small tables?
Because currently in our entire cluster, only the owner node can submit tasks on its own instance. Even if we divide into two queues, we will still encounter situations of mutual blocking. Unless we can submit tasks for both large and small tables simultaneously on the owner node. But I'm afraid this would put additional pressure on that node. Additionally, we cannot simply determine what constitutes a large table and what constitutes a small table. Therefore, weighting based on the number of rows might be more reasonable.
-
How to get the last failed automatic analysis time?
We can find the latest failed analysis from mysql.analyze_jobs. It has a fail_reason column.
-
How do we identify if a table has a new index?
During the bootstrap process, we load statistics for both indexes and columns and store them in a cache. This cache allows us to identify if a table has a new index without statistics.
-
How do we ascertain if a table has never been analyzed?
We utilize the cache to verify if there are no loaded statistics for any indexes or columns of a table. If the cache doesn't contain any statistics for a table, it indicates that the table has never been analyzed.
This feature requires a focus on both correctness and performance tests. The primary objective of the correctness tests is to validate the accuracy of priority calculations. Performance tests aim to ensure that the priority queue's operation doesn't negatively impact system performance.
These tests should cover all potential scenarios involving the priority queue. For instance, the priority queue should correctly handle situations such as:
- A table that has never undergone analysis.
- A table that has been analyzed, but its index lacks statistical data.
- Multiple tables experiencing significant changes, requiring the priority queue to assign appropriate priorities.
- A single table undergoing substantial changes, but analysis fails. In this case, the priority queue should assign the correct priority based on the last failed interval.
- Mix of the above scenarios.
This feature is designed to seamlessly integrate with all existing functionalities, ensuring that the introduction of the priority queue does not compromise the accuracy of the auto-analyze process.
To provide users with control, we will introduce a new system variable that allows the enabling or disabling of the priority queue. By default, this feature will be set to OFF
.
Following extensive testing and validation, we may consider setting the priority queue as the default option in future iterations.
Calculating the priority score for each table should not negatively impact the system's performance. We will perform extensive testing to ensure that the priority queue does not introduce any performance issues.
After the priority queue is enabled, the auto analyze process will be changed from random selection to weighted sorting. From the perspective of the user, the auto analyze process will be more reasonable.
Statistics are refreshed in the following cases:
- When there are no statistics.
- When it has been a long time since the last refresh, where "long time" is based on a moving average of the time across the last several refreshes.
- After a successful
IMPORT
orRESTORE
into the table. - After any schema change affecting the table.
- After each mutation operation (
INSERT
,UPDATE
, orDELETE
), the probability of a refresh is calculated using a formula that takes the cluster settings shown in the following table as inputs. These settings define the target number of rows in a table that must be stale before statistics on that table are refreshed. Increasing either setting will reduce the frequency of refreshes. In particular,min_stale_rows
impacts the frequency of refreshes for small tables, whilefraction_stale_rows
has more of an impact on larger tables.
Setting | Default Value | Details |
---|---|---|
sql.stats.automatic_collection.fraction_stale_rows |
0.2 | Target fraction of stale rows per table that will trigger a statistics refresh. |
sql.stats.automatic_collection.min_stale_rows |
500 | Target minimum number of stale rows per table that will trigger a statistics refresh. |
You can configure automatic statistics collection on a per-table basis.
- Each server will create a refresher to refresh the stats periodically. server_sql.go?L1126:2
- The refresher spawns a goroutine to try to trigger a refresh every minute. automatic_stats.go?L438:11
- Refreshers use a map to store all mutation counts and use it as affected rows to try to trigger a refresh. automatic_stats.go?L518:10
maybeRefreshStats
implements core logic.- Use the average full refresh time to check if too much time has passed since the last refresh. automatic_stats.go?L817:3
- Use
statsFractionStaleRows
andstatsMinStaleRows
to calculate target rows:targetRows := int64(rowCount*statsFractionStaleRows) + statsMinStaleRows
- Generate a non-negative pseudo-random number in the half-open interval
[0,targetRows)
to check if it needs to trigger a refresh. automatic_stats.go?L836:6 - Try to refresh the table. This function will execute SQL through the CRDB job framework:
CREATE STATISTICS %s FROM [%d] WITH OPTIONS THROTTLING %% AS OF SYSTEM TIME '-%s'
automatic_stats.go?L843:14 - If it meets
ConcurrentCreateStatsError
- If it must be refreshed, then set the
rowsAffected
to 0, so that we don't force a refresh if another node has already done it. - If it is not a must-be-refreshed table, we ensure that the refresh is triggered during the next cycle by passing a very large number(
math.MaxInt32
) to therowsAffected
.
- If it must be refreshed, then set the
- Clean up old mutation counts. automatic_stats.go?L540:7
-
Who can send the mutation count to the refresher?
NotifyMutation
is called by SQL mutation operations to signal the Refresher that a table has been mutated. Method NotifyMutation (automatic_stats.go?L729:21) -
How do they determine the analyzing conflicts?
They store all the analysis jobs in their database. So they can check whether there are any other CreateStats jobs in the pending, running, or paused status that started earlier than this one.
-
Which node will execute the analysis job?
Because CRDB has a scheduled job framework, it depends on the executor and scheduler.
For an auto-analysis job, it has an inline executor, it simply executes the job's SQL in a txn. Method ExecuteJob (executor_impl.go?L40:38)
The scheduler logic: In short, each node will start a scheduled job execution daemon to attempt to retrieve an executable task from the job queue for execution. Each node has a maximum number of runnable tasks.
The innodb_stats_auto_recalc variable, which is enabled by default, controls whether statistics are calculated automatically when a table undergoes changes to more than 10% of its rows. You can also configure automatic statistics recalculation for individual tables by specifying the STATS_AUTO_RECALC
clause when creating or altering a table.
- It uses a
recalc_pool
to store tables that need to be processed by background statistics gathering. dict0stats_bg.cc?L87:23 - Call
row_update_statistics_if_needed
when updating data. row0mysql.cc?L1101:20- It uses
stat_modified_counter
to indicate how many rows have been modified since the last stats recalc. When a row is inserted, updated, or deleted, it adds to this number. - If
counter
>n_rows
/ 10 (10%), then it pushes the table into therecalc_pool
. row0mysql.cc?L1119:9 - Call
dict_stats_recalc_pool_add
to add a table intorecalc_pool
. dict0stats_bg.cc?L117:6
- It uses
- A thread named
dict_stats_thread
is created to collect statistics in the background. dict0stats_bg.cc?L355:6 - The stats thread is notified by
dict_stats_event
, it is set bydict_stats_recalc_pool_add
. dict0stats_bg.cc?L137:16 - It also wakes up periodically even if not signaled. dict0stats_bg.cc?L365:5
- After it is notified, it calls
dict_stats_process_entry_from_recalc_pool
to get a table from the pool to recalculate the stats. dict0stats_bg.cc?L261:13- If there are a lot of small hot tables, it puts them back and picks another one in the next round.
-
How many auto-analysis tasks can we run at the same time on the server?
It only uses one stats thread, picking one table to run each time.
When the automatic update statistics option, AUTO_UPDATE_STATISTICS is ON, the Query Optimizer determines when statistics might be out-of-date and then updates them when they are used by a query.
Starting with SQL Server 2016 (13.x) and under the database compatibility level 130, the Database Engine also uses a decreasing, dynamic statistics recompilation threshold that adjusts according to the table cardinality at the time statistics were evaluated.
Table type | Table cardinality (n) | Recompilation threshold (# modifications) |
---|---|---|
Temporary | n < 6 | 6 |
Temporary | 6 <= n <= 500 | 500 |
Permanent | n <= 500 | 500 |
Temporary or permanent | n > 500 | MIN (500 + (0.20 n), SQRT(1,000 n)) |
For example, if your table contains 2 million rows, then the calculation is the minimum of 500 + (0.20 * 2,000,000) = 400,500
and SQRT(1,000 * 2,000,000) = 44,721
. This means the statistics will be updated every 44,721 modifications.
-
After adopting a time interval as the indicator, will it prevent us from implementing a strategy of updating the queue instead of rebuilding the entire queue in the future? Because after each change, we need to recalculate the time interval for all tables to determine their priority.
Perhaps we can tolerate a temporary delay, for example, by updating the entire queue only after encountering 100 updates.